{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0f28ca17",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Training the Neural Network"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9d322f5",
   "metadata": {},
   "source": [
    "$\\def\\abs#1{\\left\\lvert #1 \\right\\rvert}\n",
    "\\def\\Set#1{\\left\\{ #1 \\right\\}}\n",
    "\\def\\mc#1{\\mathcal{#1}}\n",
    "\\def\\M#1{\\boldsymbol{#1}}\n",
    "\\def\\R#1{\\mathsf{#1}}\n",
    "\\def\\RM#1{\\boldsymbol{\\mathsf{#1}}}\n",
    "\\def\\op#1{\\operatorname{#1}}\n",
    "\\def\\E{\\op{E}}\n",
    "\\def\\d{\\mathrm{\\mathstrut d}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03b11db3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# init\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import tensorboard as tb\n",
    "import torch\n",
    "import torch.optim as optim\n",
    "from torch import Tensor, nn\n",
    "from torch.nn import functional as F\n",
    "from torch.utils.tensorboard import SummaryWriter\n",
    "\n",
    "%load_ext tensorboard\n",
    "%load_ext jdc\n",
    "%matplotlib inline\n",
    "\n",
    "SEED = 0\n",
    "\n",
    "# create samples\n",
    "XY_rng = np.random.default_rng(SEED)\n",
    "rho = 1 - 0.19 * XY_rng.random()\n",
    "mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000\n",
    "XY = XY_rng.multivariate_normal(mean, cov, n)\n",
    "\n",
    "XY_ref_rng = np.random.default_rng(SEED)\n",
    "cov_ref, n_ = [[1, 0], [0, 1]], n\n",
    "XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92f2e864",
   "metadata": {},
   "source": [
    "We will train a neural network with `torch` and use GPU if available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce228ed2",
   "metadata": {},
   "outputs": [],
   "source": [
    "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "\n",
    "if DEVICE == \"cuda\":  # print current GPU name if available\n",
    "    print(\"Using GPU:\", torch.cuda.get_device_name(torch.cuda.current_device()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6028bb91",
   "metadata": {},
   "source": [
    "When GPU is available, you can use [GPU dashboards][gpu] on the left to monitor GPU utilizations.\n",
    "\n",
    "[gpu]: https://github.com/rapidsai/jupyterlab-nvdashboard"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39d4beb4",
   "metadata": {},
   "source": [
    "![GPU](images/gpu.dio.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a226c2f",
   "metadata": {},
   "source": [
    "**How to train a neural network by gradient descent?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21170c5d",
   "metadata": {},
   "source": [
    "We will first consider a simple implementation followed by a more practical implementation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e38f9b4",
   "metadata": {},
   "source": [
    "## A simple implementation of gradient descent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed2d1a05",
   "metadata": {},
   "source": [
    "Consider solving for a given $z\\in \\mathbb{R}$,\n",
    "\n",
    "$$ \\inf_{w\\in \\mathbb{R}} \\overbrace{e^{w\\cdot z}}^{L(w):=}.$$\n",
    "\n",
    "We will train one parameter, namely, $w$, to minimize the loss $L(w)$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "954e4523",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "What is the solution for $z=-1$?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7785a288",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "eg-min",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "````{toggle}\n",
    "**Solution**\n",
    "\n",
    "With $z=-1$,\n",
    "\n",
    "$$\n",
    "L(w) = e^{-w} \\geq 0\n",
    "$$\n",
    "\n",
    "which is achievable with equality as $w\\to \\infty$.\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c04146d",
   "metadata": {},
   "source": [
    "**How to implement the loss function?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4385788f",
   "metadata": {},
   "source": [
    "We will define the loss function using tensors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fd10f70",
   "metadata": {},
   "outputs": [],
   "source": [
    "z = Tensor([-1]).to(DEVICE)  # default tensor type on a designated device\n",
    "\n",
    "\n",
    "def L(w):\n",
    "    return (w * z).exp()\n",
    "\n",
    "\n",
    "L(float(\"inf\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55a3788e",
   "metadata": {},
   "source": [
    "The function `L` is vectorized because `Tensor` operations follow the [broadcasting rules of `numpy`](https://numpy.org/doc/stable/user/basics.broadcasting.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f364918",
   "metadata": {},
   "outputs": [],
   "source": [
    "ww = np.linspace(0, 10, 100)\n",
    "ax = sns.lineplot(\n",
    "    x=ww,\n",
    "    y=L(Tensor(ww).to(DEVICE)).cpu().numpy(),  # convert to numpy array for plotting\n",
    ")\n",
    "ax.set(xlabel=r\"$w$\", title=r\"$L(w)=e^{-w}$\")\n",
    "ax.axhline(L(float(\"inf\")), ls=\"--\", c=\"r\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c92ad90",
   "metadata": {},
   "source": [
    "**What is gradient descent?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3133179",
   "metadata": {},
   "source": [
    "A gradient descent algorithm updates the parameter $w$ iteratively starting with some initial $w_0$:\n",
    "\n",
    "$$w_{i+1} = w_i - s_i \\nabla L(w_i) \\qquad \\text{for }i\\geq 0,$$\n",
    "\n",
    "where $s_i$ is the *learning rate* (*step size*)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa759e7c",
   "metadata": {},
   "source": [
    "**How to compute the gradient?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17585554",
   "metadata": {},
   "source": [
    "With $w=0$, \n",
    "\n",
    "$$\\nabla L(0) = \\left.-e^{-w}\\right|_{w=0}=-1,$$ \n",
    "\n",
    "which can be computed using `backward` ([backpropagation][bp]):\n",
    "\n",
    "[bp]: https://en.wikipedia.org/wiki/Backpropagation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9eb7c2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "w = Tensor([0]).to(DEVICE).requires_grad_()  # requires gradient calculation for w\n",
    "L(w).backward()  # calculate the gradient by backpropagation\n",
    "w.grad"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2973b1fc",
   "metadata": {},
   "source": [
    "Under the hood, the function call `L(w)` \n",
    "\n",
    "- not only return the loss function evaluated at `w`, but also\n",
    "- updates a computational graph for calculating the gradient since `w` `requires_grad_()`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64ad513e",
   "metadata": {},
   "source": [
    "**How to implement the gradient descent?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d2c2ca2",
   "metadata": {},
   "source": [
    "With a learning rate of `0.001`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d42ae10",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(1000):\n",
    "    w.grad = None  # zero the gradient to avoid accumulation\n",
    "    L(w).backward()\n",
    "    with torch.no_grad():  # updates the weights in place without gradient calculation\n",
    "        w -= w.grad * 1e-3\n",
    "\n",
    "print(\"w:\", w.item(), \"\\nL(w):\", L(w).item())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f5d6e25",
   "metadata": {},
   "source": [
    "**What is `torch.no_grad()`?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c8ee08e",
   "metadata": {},
   "source": [
    "It sets up a context where the computational graph will not be updated. In particular,\n",
    "\n",
    "```Python\n",
    "w -= w.grad * 1e-3\n",
    "```\n",
    "\n",
    "should not be differentiated in the subsequent calculations of the gradient.\n",
    "\n",
    "[no_grad]: https://pytorch.org/docs/stable/generated/torch.no_grad.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bfb301e",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "Repeatedly run the above cell until you get `L(w)` below `0.001`. How large is the value of `w`? What is the limitations of the simple gradient descent algorithm?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13905c71",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "gd-limitations",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "````{toggle}\n",
    "**Solution** \n",
    "\n",
    "The value of `w` needs to be smaller than `6.9`. The convergence can be slow, especially when the learning rate is small. Also, `w` can be far away from its optimal value even if `L(w)` is close to its minimum.\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8d01a61",
   "metadata": {},
   "source": [
    "## A practical implementation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34c6e040",
   "metadata": {},
   "source": [
    "For a neural network to approximate a sophisticated function, it should have many parameters (*degrees of freedom*)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f28bdf5d",
   "metadata": {},
   "source": [
    "**How to define a neural network?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30714ea4",
   "metadata": {},
   "source": [
    "The following code [defines a simple neural network][define] with 3 fully-connected (fc) hidden layers:\n",
    "\n",
    "![Neural net](images/nn.dio.svg) \n",
    "\n",
    "where \n",
    "\n",
    "- $\\M{W}_l$ and $\\M{b}_l$ are the weight and bias respectively for the linear transformation $\\M{W}_l \\M{a}_l + \\M{b}_l$ of the $l$-th layer; and\n",
    "- $\\sigma$ for the first 2 hidden layers is an activation function called the [*exponential linear unit (ELU)*](https://pytorch.org/docs/stable/generated/torch.nn.ELU.html).\n",
    "\n",
    "[define]: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f192de8",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Net(nn.Module):\n",
    "    def __init__(self, input_size=2, hidden_size=100, sigma=0.02):\n",
    "        super().__init__()\n",
    "        self.fc1 = nn.Linear(input_size, hidden_size)  # fully-connected (fc) layer\n",
    "        self.fc2 = nn.Linear(hidden_size, hidden_size)  # layer 2\n",
    "        self.fc3 = nn.Linear(hidden_size, 1)  # layer 3\n",
    "        nn.init.normal_(self.fc1.weight, std=sigma)  #\n",
    "        nn.init.constant_(self.fc1.bias, 0)\n",
    "        nn.init.normal_(self.fc2.weight, std=sigma)\n",
    "        nn.init.constant_(self.fc2.bias, 0)\n",
    "        nn.init.normal_(self.fc3.weight, std=sigma)\n",
    "        nn.init.constant_(self.fc3.bias, 0)\n",
    "\n",
    "    def forward(self, z):\n",
    "        a1 = F.elu(self.fc1(z))\n",
    "        a2 = F.elu(self.fc2(a1))\n",
    "        t = self.fc3(a2)\n",
    "        return t\n",
    "\n",
    "\n",
    "torch.manual_seed(SEED)  # seed RNG for PyTorch\n",
    "net = Net().to(DEVICE)\n",
    "print(net)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfba0ab1",
   "metadata": {},
   "source": [
    "The neural network is also a vectorized function. E.g., the following call `net` once to plots the density estimate of all $t(\\R{Z}_i)$'s and $t(\\R{Z}'_i)$'s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a026231c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "Z = Tensor(XY).to(DEVICE)\n",
    "Z_ref = Tensor(XY_ref).to(DEVICE)\n",
    "\n",
    "tZ = (\n",
    "    net(torch.cat((Z, Z_ref), dim=0))  # compute t(Z_i)'s and t(Z'_i)\n",
    "    # output needs to be converted back to an array on CPU for plotting\n",
    "    .cpu()  # copy back to CPU\n",
    "    .detach()  # detach from current graph (no gradient calculation)\n",
    "    .numpy()  # convert output back to numpy\n",
    ")\n",
    "\n",
    "tZ_df = pd.DataFrame(data=tZ, columns=[\"t\"])\n",
    "sns.kdeplot(data=tZ_df, x=\"t\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ea1a053",
   "metadata": {},
   "source": [
    "For 2D sample $(x,y)\\in \\mc{Z}$, we can plot the neural network $t(x,y)$ as a heatmap. The following code adds a method `plot` to `Net` using [`jdc`](https://alexhagen.github.io/jdc):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6b07df7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%add_to Net\n",
    "def plot(net, xmin=-5, xmax=5, ymin=-5, ymax=5, xgrids=50, ygrids=50, ax=None):\n",
    "    \"\"\"Plot a heat map of a neural network net. net can only have two inputs.\"\"\"\n",
    "    x, y = np.mgrid[xmin : xmax : xgrids * 1j, ymin : ymax : ygrids * 1j]\n",
    "    xy = np.concatenate((x[:, :, None], y[:, :, None]), axis=2)\n",
    "    with torch.no_grad():\n",
    "        z = (\n",
    "            net(\n",
    "                torch.cat(\n",
    "                    [\n",
    "                        Tensor(x.reshape(-1, 1)).to(DEVICE),\n",
    "                        Tensor(y.reshape(-1, 1)).to(DEVICE),\n",
    "                    ],\n",
    "                    dim=-1,\n",
    "                )\n",
    "            )\n",
    "            .reshape(x.shape)\n",
    "            .cpu()\n",
    "        )\n",
    "    if ax is None:\n",
    "        ax = plt.gca()\n",
    "    im = ax.pcolormesh(x, y, z, cmap=\"RdBu_r\", shading=\"auto\")\n",
    "    ax.figure.colorbar(im)\n",
    "    ax.set(xlabel=r\"$x$\", ylabel=r\"$y$\", title=r\"Heatmap of $t(z)$ for $z=(x,y)$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3915a5da",
   "metadata": {},
   "source": [
    "To plot the heatmap:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44f37d9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "net.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "704dd52e",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "Why are the values of $t(\\R{Z}_i)$'s and $t(\\R{Z}'_i)$'s concentrated around $0$?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3134ff5b",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "init-nn",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "````{toggle}\n",
    "**Solution** \n",
    "\n",
    "The neural network parameters are all very close to $0$ as we have set a small variance `sigma=0.02` to initialize them randomly. Hence:\n",
    "\n",
    "- The linear transformation $\\M{W}_l (\\cdot) + \\M{b}_l$ is close to $0$ for when the weight and bias are close to $0$. \n",
    "- The ELU activation function $\\sigma$ is also close to $0$ if its input is close to $0$.\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47612c7c",
   "metadata": {},
   "source": [
    "**How to implements the divergence estimate?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f5eb491",
   "metadata": {},
   "source": [
    "We decompose the approximate divergence lower bound in {eq}`avg-DV` as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8810ecbd",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{align}\n",
    "\\op{DV}(\\R{Z}^n,\\R{Z'}^{n'},\\theta) &:= \\underbrace{\\frac1{n} \\sum_{i\\in [n]} t(\\R{Z}_i)}_{\\text{(a)}} - \\underbrace{\\log \\frac1{n'} \\sum_{i\\in [n']} e^{t(\\R{Z}'_i)}}_{ \\underbrace{\\log \\sum_{i\\in [n']} e^{t(\\R{Z}'_i)}}_{\\text{(b)}} - \\underbrace{\\log n'}_{\\text{(c)}}} \n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "where $\\theta$ is a tuple of parameters (weights and biases) of the neural network that computes $t$:\n",
    "\n",
    "$$\n",
    "\\theta := (\\M{W}_l,\\M{b}_l|l\\in [3]).\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbaf9cc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def DV(Z, Z_ref, net):\n",
    "    avg_tZ = net(Z).mean()  # (a)\n",
    "    log_avg_etZ_ref = net(Z_ref).logsumexp(dim=0) - np.log(Z_ref.shape[0])  # (b) - (c)\n",
    "    return avg_tZ - log_avg_etZ_ref\n",
    "\n",
    "\n",
    "DV_estimate = DV(Z, Z_ref, net)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "894eff71",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "Why is it preferrable to use `logsumexp(dim=0)` instead of `.exp().sum().log()`? Try running\n",
    "\n",
    "```Python\n",
    "Tensor([100]).exp().log(), Tensor([100]).logsumexp(0)\n",
    "```\n",
    "\n",
    "in a separate console."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "953a4e23",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "logsumexp",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "````{toggle}\n",
    "**Solution** \n",
    "\n",
    "`logsumexp(dim=0)` is numerically more stable than `.exp().mean().log()` especially when the output of the exponential function is too large to be represented with the default floating point precision.\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab17702",
   "metadata": {},
   "source": [
    "To calculate the gradient of the divergence estimate with respect to $\\theta$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd808bcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "net.zero_grad()  # zero the gradient values of all neural network parameters\n",
    "DV(Z, Z_ref, net).backward()  # calculate the gradient\n",
    "a_param = next(net.parameters())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b5aedc6",
   "metadata": {},
   "source": [
    "`a_param` is a (module) parameter in $\\theta$ retrieved from the parameter iterator `parameters()`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90609830",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "Check that the value of `a_param.grad` is non-zero. Is `a_param` a weight or a bias?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79618990",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "param",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "````{toggle}\n",
    "**Solution** \n",
    "\n",
    "It should be the weight matrix $\\M{W}_1$ because the shape is `torch.Size([100, 2])`.\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dad05aa",
   "metadata": {},
   "source": [
    "**How to gradient descend?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "051d42ff",
   "metadata": {},
   "source": [
    "We will use the [*Adam's* gradient descend algorithm][adam] implemented as an optimizer [`optim.Adam`][optimAdam]:\n",
    "\n",
    "[adam]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent#cite_note-Adam2014-28\n",
    "[optimAdam]: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1441bc66",
   "metadata": {},
   "outputs": [],
   "source": [
    "net = Net().to(DEVICE)\n",
    "optimizer = optim.Adam(\n",
    "    net.parameters(), lr=1e-3\n",
    ")  # Allow Adam's optimizer to update the neural network parameters\n",
    "optimizer.step()  # perform one step of the gradient descent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32243fd7",
   "metadata": {},
   "source": [
    "To alleviate the problem of overfitting, the gradient is often calculated on randomly chosen batches:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16c0bdf7",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{align}\n",
    "\\R{L}(\\theta) := - \\bigg[\\frac1{\\abs{\\R{B}}} \\sum_{i\\in \\R{B}} t(\\R{Z}_i) - \\log \\frac1{\\abs{\\R{B}'}} \\sum_{i\\in \\R{B}'} e^{t(\\R{Z}'_i)} - \\log \\abs{\\R{B}'} \\bigg],\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "which is the negative lower bound of the VD formula in {eq}`DV` but on the minibatches \n",
    "\n",
    "$$\\R{Z}_{\\R{B}}:=(\\R{Z}_i\\mid i\\in \\R{B})\\quad \\text{and}\\quad \\R{Z}'_{\\R{B}'}$$\n",
    "\n",
    "where $\\R{B}$ and $\\R{B}'$ are batches of uniformly randomly chosen indices from $[n]$ and $[n']$ respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e87a0ca3",
   "metadata": {},
   "source": [
    "The neural network parameter is updated\n",
    "\n",
    "$$\n",
    "\\theta_{j+1} := \\theta_j - s_j \\nabla \\R{L}_j(\\theta_j),\n",
    "$$\n",
    "\n",
    "starting with a randomly initialized $\\theta_0$ \n",
    "where $s_j>0$ is the learning rate and $\\R{L}_j$ is the loss evaluated on the $j$-th randomly chosen batches $\\R{B}_j$ and $\\R{B}'_j$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c87b4825",
   "metadata": {},
   "source": [
    "The different batches are often obtained by \n",
    "- permuting the samples first, and then\n",
    "- partitioning the samples into batches.\n",
    "\n",
    "This is illustrated by the figure below:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b20a10fd",
   "metadata": {},
   "source": [
    "![Minibatch gradient descent](images/batch.dio.svg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0c1d1d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "n_iters_per_epoch = 10  # ideally a divisor of both n and n'\n",
    "batch_size = int((Z.shape[0] + 0.5) / n_iters_per_epoch)\n",
    "batch_size_ref = int((Z_ref.shape[0] + 0.5) / n_iters_per_epoch)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a3f2973",
   "metadata": {},
   "source": [
    "We will use `tensorboard` to show the training logs.  \n",
    "Rerun the following to start a new log, for instance, after a change of parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8038ef00",
   "metadata": {},
   "outputs": [],
   "source": [
    "if input(\"New log?[Y/n] \").lower() != \"n\":\n",
    "    n_iter = n_epoch = 0  # keep counts for logging\n",
    "    writer = SummaryWriter()  # create a new folder under runs/ for logging"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cedf32db",
   "metadata": {},
   "source": [
    "The following code carries out Adam's gradient descent on batch loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "379ee580",
   "metadata": {},
   "outputs": [],
   "source": [
    "if input(\"Train? [Y/n]\").lower() != \"n\":\n",
    "    for i in range(10):  # loop through entire data multiple times\n",
    "        n_epoch += 1\n",
    "\n",
    "        # random indices for selecting samples for all batches in one epoch\n",
    "        idx = torch.randperm(Z.shape[0])\n",
    "        idx_ref = torch.randperm(Z_ref.shape[0])\n",
    "\n",
    "        for j in range(n_iters_per_epoch):  # loop through multiple batches\n",
    "            n_iter += 1\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "            # obtain a random batch of samples\n",
    "            batch_Z = Z[idx[j : Z.shape[0] : n_iters_per_epoch]]\n",
    "            batch_Z_ref = Z_ref[idx_ref[j : Z_ref.shape[0] : n_iters_per_epoch]]\n",
    "\n",
    "            # define the loss as negative DV divergence lower bound\n",
    "            loss = -DV(batch_Z, batch_Z_ref, net)\n",
    "            loss.backward()  # calculate gradient\n",
    "            optimizer.step()  # descend\n",
    "\n",
    "        writer.add_scalar(\"Loss/train\", loss.item(), global_step=n_epoch)\n",
    "\n",
    "    # Estimate the divergence using all data\n",
    "    with torch.no_grad():\n",
    "        estimate = DV(Z, Z_ref, net).item()\n",
    "        writer.add_scalar(\"Estimate\", estimate, global_step=n_epoch)\n",
    "        net.plot()\n",
    "        print(\"Divergence estimation:\", estimate)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8ec6d90",
   "metadata": {},
   "source": [
    "Run the following to show the losses and divergence estimate in `tensorboard`. You can rerun the above cell to train the neural network more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58bea532",
   "metadata": {},
   "outputs": [],
   "source": [
    "%tensorboard --logdir=runs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae72f69f",
   "metadata": {},
   "source": [
    "The ground truth is given by\n",
    "\n",
    "$$D(P_{\\R{Z}}\\|P_{\\R{Z}'}) = \\frac12 \\log(1-\\rho^2) $$\n",
    "\n",
    "where $\\rho$ is the randomly generated correlation in the previous notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00cd5bd8",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "Compute the ground truth using the formula above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf133416",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "ground_truth",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "### BEGIN SOLUTION\n",
    "ground_truth = -0.5 * np.log(1 - rho ** 2)\n",
    "### END SOLUTION\n",
    "ground_truth"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f092de5d",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "See if you can get an estimate close to this value by training the neural network repeatedly as shown below.\n",
    "\n",
    "![Divergence estimate](images/div_est.dio.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1550c7e1",
   "metadata": {},
   "source": [
    "## Encapsulation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f66fa4",
   "metadata": {},
   "source": [
    "It is a good idea to encapsulate the training by a class, so multiple configurations can be run without interfering each other:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "816d55ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "class DVTrainer:\n",
    "    \"\"\"\n",
    "    Neural estimator for KL divergence based on the sample DV lower bound.\n",
    "\n",
    "    Estimate D(P_Z||P_Z') using samples Z and Z' by training a network t to maximize\n",
    "        avg(t(Z)) - log avg(e^t(Z'))\n",
    "\n",
    "    Parameters:\n",
    "    ----------\n",
    "\n",
    "    Z, Z_ref : Tensors with first dimension indicing the samples of Z and Z' respect.\n",
    "    net : The neural network t that take Z as input and output a real number for each sample.\n",
    "    n_iters_per_epoch : Number of iterations per epoch.\n",
    "    writer_params : Parameters to be passed to SummaryWriter for logging.\n",
    "    \"\"\"\n",
    "\n",
    "    # constructor\n",
    "    def __init__(self, Z, Z_ref, net, n_iters_per_epoch, writer_params={}, **kwargs):\n",
    "        self.Z = Z\n",
    "        self.Z_ref = Z_ref\n",
    "        self.net = net\n",
    "        self.n_iters_per_epoch = n_iters_per_epoch  # ideally a divisor of both n and n'\n",
    "\n",
    "        # set optimizer\n",
    "        self.optimizer = optim.Adam(net.parameters(), **kwargs)\n",
    "\n",
    "        # logging\n",
    "        self.writer = SummaryWriter(\n",
    "            **writer_params\n",
    "        )  # create a new folder under runs/ for logging\n",
    "        self.n_iter = self.n_epoch = 0  # keep counts for logging\n",
    "\n",
    "    def step(self, epochs=1):\n",
    "        \"\"\"\n",
    "        Carries out the gradient descend for a number of epochs and returns\n",
    "        the divergence estimate evaluated over the entire data.\n",
    "\n",
    "        Loss for each epoch is recorded into the log, but only one divergence\n",
    "        estimate is computed/logged using the entire dataset. Rerun the method,\n",
    "        using a loop, to continue to train the neural network and log the result.\n",
    "\n",
    "        Parameters:\n",
    "        ----------\n",
    "        epochs : number of epochs\n",
    "        \"\"\"\n",
    "        for i in range(epochs):\n",
    "            self.n_epoch += 1\n",
    "\n",
    "            # random indices for selecting samples for all batches in one epoch\n",
    "            idx = torch.randperm(self.Z.shape[0])\n",
    "            idx_ref = torch.randperm(self.Z_ref.shape[0])\n",
    "\n",
    "            for j in range(self.n_iters_per_epoch):\n",
    "                self.n_iter += 1\n",
    "                self.optimizer.zero_grad()\n",
    "\n",
    "                # obtain a random batch of samples\n",
    "                batch_Z = self.Z[idx[i : self.Z.shape[0] : self.n_iters_per_epoch]]\n",
    "                batch_Z_ref = self.Z_ref[\n",
    "                    idx_ref[i : self.Z_ref.shape[0] : self.n_iters_per_epoch]\n",
    "                ]\n",
    "\n",
    "                # define the loss as negative DV divergence lower bound\n",
    "                loss = -DV(batch_Z, batch_Z_ref, self.net)\n",
    "                loss.backward()  # calculate gradient\n",
    "                self.optimizer.step()  # descend\n",
    "\n",
    "                self.writer.add_scalar(\n",
    "                    \"Loss/train\", loss.item(), global_step=self.n_iter\n",
    "                )\n",
    "\n",
    "        with torch.no_grad():\n",
    "            estimate = DV(Z, Z_ref, self.net).item()\n",
    "            self.writer.add_scalar(\"Estimate\", estimate, global_step=self.n_epoch)\n",
    "            return estimate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96f4d34f",
   "metadata": {},
   "source": [
    "To use the above class to train, we first create an instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36ae45a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(SEED)\n",
    "net = Net().to(DEVICE)\n",
    "trainer = DVTrainer(Z, Z_ref, net, n_iters_per_epoch=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "726e9e4d",
   "metadata": {},
   "source": [
    "Next, run `step` iteratively to train the neural network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fff8e16",
   "metadata": {},
   "outputs": [],
   "source": [
    "if input(\"Train? [Y/n]\").lower() != \"n\":\n",
    "    for i in range(10):\n",
    "        print(\"Divergence estimate:\", trainer.step(10))\n",
    "    net.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "138d17e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%tensorboard --logdir=runs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25fe5a21",
   "metadata": {},
   "source": [
    "## Clean-up"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0991350",
   "metadata": {},
   "source": [
    "It is important to release the resources if it is no longer used. You can release the memory or GPU memory by `Kernel->Shut Down Kernel`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edca15f5",
   "metadata": {},
   "source": [
    "To clear the logs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b57c1de",
   "metadata": {},
   "outputs": [],
   "source": [
    "if input('Delete logs? [y/N]').lower() == 'y':\n",
    "    !rm -rf ./runs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ff66114",
   "metadata": {},
   "source": [
    "To kill a tensorboard instance without shutting down the notebook kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d511ffe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "tb.notebook.list() # list all the running TensorBoard notebooks.\n",
    "while (pid := input('pid to kill? (press enter to exit)')):\n",
    "    !kill {pid}"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.10.3"
   }
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "source_map": [
   14,
   18,
   30,
   58,
   62,
   67,
   73,
   77,
   81,
   85,
   89,
   97,
   103,
   118,
   122,
   126,
   135,
   139,
   148,
   152,
   160,
   164,
   174,
   178,
   185,
   189,
   193,
   201,
   205,
   217,
   223,
   233,
   237,
   241,
   245,
   258,
   282,
   286,
   303,
   307,
   334,
   338,
   340,
   346,
   358,
   362,
   366,
   380,
   388,
   400,
   409,
   413,
   417,
   421,
   427,
   436,
   440,
   447,
   453,
   457,
   471,
   482,
   490,
   494,
   498,
   503,
   507,
   511,
   541,
   545,
   547,
   555,
   561,
   576,
   584,
   588,
   592,
   668,
   672,
   676,
   680,
   687,
   689,
   693,
   697,
   701,
   704,
   708
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}