--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.3 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- +++ {"slideshow": {"slide_type": "slide"}} # Training the Neural Network +++ $\def\abs#1{\left\lvert #1 \right\rvert} \def\Set#1{\left\{ #1 \right\}} \def\mc#1{\mathcal{#1}} \def\M#1{\boldsymbol{#1}} \def\R#1{\mathsf{#1}} \def\RM#1{\boldsymbol{\mathsf{#1}}} \def\op#1{\operatorname{#1}} \def\E{\op{E}} \def\d{\mathrm{\mathstrut d}}$ ```{code-cell} ipython3 # init import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import tensorboard as tb import torch import torch.optim as optim from torch import Tensor, nn from torch.nn import functional as F from torch.utils.tensorboard import SummaryWriter %load_ext tensorboard %load_ext jdc %matplotlib inline SEED = 0 # create samples XY_rng = np.random.default_rng(SEED) rho = 1 - 0.19 * XY_rng.random() mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000 XY = XY_rng.multivariate_normal(mean, cov, n) XY_ref_rng = np.random.default_rng(SEED) cov_ref, n_ = [[1, 0], [0, 1]], n XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_) ``` We will train a neural network with `torch` and use GPU if available: ```{code-cell} ipython3 DEVICE = "cuda" if torch.cuda.is_available() else "cpu" if DEVICE == "cuda": # print current GPU name if available print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device())) ``` When GPU is available, you can use [GPU dashboards][gpu] on the left to monitor GPU utilizations. [gpu]: https://github.com/rapidsai/jupyterlab-nvdashboard +++ ![GPU](images/gpu.dio.svg) +++ **How to train a neural network by gradient descent?** +++ We will first consider a simple implementation followed by a more practical implementation. +++ ## A simple implementation of gradient descent +++ Consider solving for a given $z\in \mathbb{R}$, $$ \inf_{w\in \mathbb{R}} \overbrace{e^{w\cdot z}}^{L(w):=}.$$ We will train one parameter, namely, $w$, to minimize the loss $L(w)$. +++ **Exercise** What is the solution for $z=-1$? +++ {"nbgrader": {"grade": true, "grade_id": "eg-min", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false}} ````{toggle} **Solution** With $z=-1$, $$ L(w) = e^{-w} \geq 0 $$ which is achievable with equality as $w\to \infty$. ```` +++ **How to implement the loss function?** +++ We will define the loss function using tensors: ```{code-cell} ipython3 z = Tensor([-1]).to(DEVICE) # default tensor type on a designated device def L(w): return (w * z).exp() L(float("inf")) ``` The function `L` is vectorized because `Tensor` operations follow the [broadcasting rules of `numpy`](https://numpy.org/doc/stable/user/basics.broadcasting.html): ```{code-cell} ipython3 ww = np.linspace(0, 10, 100) ax = sns.lineplot( x=ww, y=L(Tensor(ww).to(DEVICE)).cpu().numpy(), # convert to numpy array for plotting ) ax.set(xlabel=r"$w$", title=r"$L(w)=e^{-w}$") ax.axhline(L(float("inf")), ls="--", c="r") plt.show() ``` **What is gradient descent?** +++ A gradient descent algorithm updates the parameter $w$ iteratively starting with some initial $w_0$: $$w_{i+1} = w_i - s_i \nabla L(w_i) \qquad \text{for }i\geq 0,$$ where $s_i$ is the *learning rate* (*step size*). +++ **How to compute the gradient?** +++ With $w=0$, $$\nabla L(0) = \left.-e^{-w}\right|_{w=0}=-1,$$ which can be computed using `backward` ([backpropagation][bp]): [bp]: https://en.wikipedia.org/wiki/Backpropagation ```{code-cell} ipython3 w = Tensor([0]).to(DEVICE).requires_grad_() # requires gradient calculation for w L(w).backward() # calculate the gradient by backpropagation w.grad ``` Under the hood, the function call `L(w)` - not only return the loss function evaluated at `w`, but also - updates a computational graph for calculating the gradient since `w` `requires_grad_()`. +++ **How to implement the gradient descent?** +++ With a learning rate of `0.001`: ```{code-cell} ipython3 for i in range(1000): w.grad = None # zero the gradient to avoid accumulation L(w).backward() with torch.no_grad(): # updates the weights in place without gradient calculation w -= w.grad * 1e-3 print("w:", w.item(), "\nL(w):", L(w).item()) ``` **What is `torch.no_grad()`?** +++ It sets up a context where the computational graph will not be updated. In particular, ```Python w -= w.grad * 1e-3 ``` should not be differentiated in the subsequent calculations of the gradient. [no_grad]: https://pytorch.org/docs/stable/generated/torch.no_grad.html +++ **Exercise** Repeatedly run the above cell until you get `L(w)` below `0.001`. How large is the value of `w`? What is the limitations of the simple gradient descent algorithm? +++ {"nbgrader": {"grade": true, "grade_id": "gd-limitations", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false}} ````{toggle} **Solution** The value of `w` needs to be smaller than `6.9`. The convergence can be slow, especially when the learning rate is small. Also, `w` can be far away from its optimal value even if `L(w)` is close to its minimum. ```` +++ ## A practical implementation +++ For a neural network to approximate a sophisticated function, it should have many parameters (*degrees of freedom*). +++ **How to define a neural network?** +++ The following code [defines a simple neural network][define] with 3 fully-connected (fc) hidden layers: ![Neural net](images/nn.dio.svg) where - $\M{W}_l$ and $\M{b}_l$ are the weight and bias respectively for the linear transformation $\M{W}_l \M{a}_l + \M{b}_l$ of the $l$-th layer; and - $\sigma$ for the first 2 hidden layers is an activation function called the [*exponential linear unit (ELU)*](https://pytorch.org/docs/stable/generated/torch.nn.ELU.html). [define]: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network ```{code-cell} ipython3 class Net(nn.Module): def __init__(self, input_size=2, hidden_size=100, sigma=0.02): super().__init__() self.fc1 = nn.Linear(input_size, hidden_size) # fully-connected (fc) layer self.fc2 = nn.Linear(hidden_size, hidden_size) # layer 2 self.fc3 = nn.Linear(hidden_size, 1) # layer 3 nn.init.normal_(self.fc1.weight, std=sigma) # nn.init.constant_(self.fc1.bias, 0) nn.init.normal_(self.fc2.weight, std=sigma) nn.init.constant_(self.fc2.bias, 0) nn.init.normal_(self.fc3.weight, std=sigma) nn.init.constant_(self.fc3.bias, 0) def forward(self, z): a1 = F.elu(self.fc1(z)) a2 = F.elu(self.fc2(a1)) t = self.fc3(a2) return t torch.manual_seed(SEED) # seed RNG for PyTorch net = Net().to(DEVICE) print(net) ``` The neural network is also a vectorized function. E.g., the following call `net` once to plots the density estimate of all $t(\R{Z}_i)$'s and $t(\R{Z}'_i)$'s. ```{code-cell} ipython3 :tags: [] Z = Tensor(XY).to(DEVICE) Z_ref = Tensor(XY_ref).to(DEVICE) tZ = ( net(torch.cat((Z, Z_ref), dim=0)) # compute t(Z_i)'s and t(Z'_i) # output needs to be converted back to an array on CPU for plotting .cpu() # copy back to CPU .detach() # detach from current graph (no gradient calculation) .numpy() # convert output back to numpy ) tZ_df = pd.DataFrame(data=tZ, columns=["t"]) sns.kdeplot(data=tZ_df, x="t") plt.show() ``` For 2D sample $(x,y)\in \mc{Z}$, we can plot the neural network $t(x,y)$ as a heatmap. The following code adds a method `plot` to `Net` using [`jdc`](https://alexhagen.github.io/jdc): ```{code-cell} ipython3 :tags: [] %%add_to Net def plot(net, xmin=-5, xmax=5, ymin=-5, ymax=5, xgrids=50, ygrids=50, ax=None): """Plot a heat map of a neural network net. net can only have two inputs.""" x, y = np.mgrid[xmin : xmax : xgrids * 1j, ymin : ymax : ygrids * 1j] xy = np.concatenate((x[:, :, None], y[:, :, None]), axis=2) with torch.no_grad(): z = ( net( torch.cat( [ Tensor(x.reshape(-1, 1)).to(DEVICE), Tensor(y.reshape(-1, 1)).to(DEVICE), ], dim=-1, ) ) .reshape(x.shape) .cpu() ) if ax is None: ax = plt.gca() im = ax.pcolormesh(x, y, z, cmap="RdBu_r", shading="auto") ax.figure.colorbar(im) ax.set(xlabel=r"$x$", ylabel=r"$y$", title=r"Heatmap of $t(z)$ for $z=(x,y)$") ``` To plot the heatmap: ```{code-cell} ipython3 net.plot() ``` **Exercise** Why are the values of $t(\R{Z}_i)$'s and $t(\R{Z}'_i)$'s concentrated around $0$? +++ {"nbgrader": {"grade": true, "grade_id": "init-nn", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false}} ````{toggle} **Solution** The neural network parameters are all very close to $0$ as we have set a small variance `sigma=0.02` to initialize them randomly. Hence: - The linear transformation $\M{W}_l (\cdot) + \M{b}_l$ is close to $0$ for when the weight and bias are close to $0$. - The ELU activation function $\sigma$ is also close to $0$ if its input is close to $0$. ```` +++ **How to implements the divergence estimate?** +++ We decompose the approximate divergence lower bound in {eq}`avg-DV` as follows: +++ $$ \begin{align} \op{DV}(\R{Z}^n,\R{Z'}^{n'},\theta) &:= \underbrace{\frac1{n} \sum_{i\in [n]} t(\R{Z}_i)}_{\text{(a)}} - \underbrace{\log \frac1{n'} \sum_{i\in [n']} e^{t(\R{Z}'_i)}}_{ \underbrace{\log \sum_{i\in [n']} e^{t(\R{Z}'_i)}}_{\text{(b)}} - \underbrace{\log n'}_{\text{(c)}}} \end{align} $$ where $\theta$ is a tuple of parameters (weights and biases) of the neural network that computes $t$: $$ \theta := (\M{W}_l,\M{b}_l|l\in [3]). $$ ```{code-cell} ipython3 def DV(Z, Z_ref, net): avg_tZ = net(Z).mean() # (a) log_avg_etZ_ref = net(Z_ref).logsumexp(dim=0) - np.log(Z_ref.shape[0]) # (b) - (c) return avg_tZ - log_avg_etZ_ref DV_estimate = DV(Z, Z_ref, net) ``` **Exercise** Why is it preferrable to use `logsumexp(dim=0)` instead of `.exp().sum().log()`? Try running ```Python Tensor([100]).exp().log(), Tensor([100]).logsumexp(0) ``` in a separate console. +++ {"nbgrader": {"grade": true, "grade_id": "logsumexp", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false}} ````{toggle} **Solution** `logsumexp(dim=0)` is numerically more stable than `.exp().mean().log()` especially when the output of the exponential function is too large to be represented with the default floating point precision. ```` +++ To calculate the gradient of the divergence estimate with respect to $\theta$: ```{code-cell} ipython3 net.zero_grad() # zero the gradient values of all neural network parameters DV(Z, Z_ref, net).backward() # calculate the gradient a_param = next(net.parameters()) ``` `a_param` is a (module) parameter in $\theta$ retrieved from the parameter iterator `parameters()`. +++ **Exercise** Check that the value of `a_param.grad` is non-zero. Is `a_param` a weight or a bias? +++ {"nbgrader": {"grade": true, "grade_id": "param", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false}} ````{toggle} **Solution** It should be the weight matrix $\M{W}_1$ because the shape is `torch.Size([100, 2])`. ```` +++ **How to gradient descend?** +++ We will use the [*Adam's* gradient descend algorithm][adam] implemented as an optimizer [`optim.Adam`][optimAdam]: [adam]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent#cite_note-Adam2014-28 [optimAdam]: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam ```{code-cell} ipython3 net = Net().to(DEVICE) optimizer = optim.Adam( net.parameters(), lr=1e-3 ) # Allow Adam's optimizer to update the neural network parameters optimizer.step() # perform one step of the gradient descent ``` To alleviate the problem of overfitting, the gradient is often calculated on randomly chosen batches: +++ $$ \begin{align} \R{L}(\theta) := - \bigg[\frac1{\abs{\R{B}}} \sum_{i\in \R{B}} t(\R{Z}_i) - \log \frac1{\abs{\R{B}'}} \sum_{i\in \R{B}'} e^{t(\R{Z}'_i)} - \log \abs{\R{B}'} \bigg], \end{align} $$ which is the negative lower bound of the VD formula in {eq}`DV` but on the minibatches $$\R{Z}_{\R{B}}:=(\R{Z}_i\mid i\in \R{B})\quad \text{and}\quad \R{Z}'_{\R{B}'}$$ where $\R{B}$ and $\R{B}'$ are batches of uniformly randomly chosen indices from $[n]$ and $[n']$ respectively. +++ The neural network parameter is updated $$ \theta_{j+1} := \theta_j - s_j \nabla \R{L}_j(\theta_j), $$ starting with a randomly initialized $\theta_0$ where $s_j>0$ is the learning rate and $\R{L}_j$ is the loss evaluated on the $j$-th randomly chosen batches $\R{B}_j$ and $\R{B}'_j$. +++ The different batches are often obtained by - permuting the samples first, and then - partitioning the samples into batches. This is illustrated by the figure below: +++ ![Minibatch gradient descent](images/batch.dio.svg) ```{code-cell} ipython3 n_iters_per_epoch = 10 # ideally a divisor of both n and n' batch_size = int((Z.shape[0] + 0.5) / n_iters_per_epoch) batch_size_ref = int((Z_ref.shape[0] + 0.5) / n_iters_per_epoch) ``` We will use `tensorboard` to show the training logs. Rerun the following to start a new log, for instance, after a change of parameters. ```{code-cell} ipython3 if input("New log?[Y/n] ").lower() != "n": n_iter = n_epoch = 0 # keep counts for logging writer = SummaryWriter() # create a new folder under runs/ for logging ``` The following code carries out Adam's gradient descent on batch loss: ```{code-cell} ipython3 if input("Train? [Y/n]").lower() != "n": for i in range(10): # loop through entire data multiple times n_epoch += 1 # random indices for selecting samples for all batches in one epoch idx = torch.randperm(Z.shape[0]) idx_ref = torch.randperm(Z_ref.shape[0]) for j in range(n_iters_per_epoch): # loop through multiple batches n_iter += 1 optimizer.zero_grad() # obtain a random batch of samples batch_Z = Z[idx[j : Z.shape[0] : n_iters_per_epoch]] batch_Z_ref = Z_ref[idx_ref[j : Z_ref.shape[0] : n_iters_per_epoch]] # define the loss as negative DV divergence lower bound loss = -DV(batch_Z, batch_Z_ref, net) loss.backward() # calculate gradient optimizer.step() # descend writer.add_scalar("Loss/train", loss.item(), global_step=n_epoch) # Estimate the divergence using all data with torch.no_grad(): estimate = DV(Z, Z_ref, net).item() writer.add_scalar("Estimate", estimate, global_step=n_epoch) net.plot() print("Divergence estimation:", estimate) ``` Run the following to show the losses and divergence estimate in `tensorboard`. You can rerun the above cell to train the neural network more. ```{code-cell} ipython3 %tensorboard --logdir=runs ``` The ground truth is given by $$D(P_{\R{Z}}\|P_{\R{Z}'}) = \frac12 \log(1-\rho^2) $$ where $\rho$ is the randomly generated correlation in the previous notebook. +++ **Exercise** Compute the ground truth using the formula above. ```{code-cell} ipython3 --- nbgrader: grade: false grade_id: ground_truth locked: false schema_version: 3 solution: true task: false tags: [hide-cell] --- ### BEGIN SOLUTION ground_truth = -0.5 * np.log(1 - rho ** 2) ### END SOLUTION ground_truth ``` **Exercise** See if you can get an estimate close to this value by training the neural network repeatedly as shown below. ![Divergence estimate](images/div_est.dio.svg) +++ ## Encapsulation +++ It is a good idea to encapsulate the training by a class, so multiple configurations can be run without interfering each other: ```{code-cell} ipython3 class DVTrainer: """ Neural estimator for KL divergence based on the sample DV lower bound. Estimate D(P_Z||P_Z') using samples Z and Z' by training a network t to maximize avg(t(Z)) - log avg(e^t(Z')) Parameters: ---------- Z, Z_ref : Tensors with first dimension indicing the samples of Z and Z' respect. net : The neural network t that take Z as input and output a real number for each sample. n_iters_per_epoch : Number of iterations per epoch. writer_params : Parameters to be passed to SummaryWriter for logging. """ # constructor def __init__(self, Z, Z_ref, net, n_iters_per_epoch, writer_params={}, **kwargs): self.Z = Z self.Z_ref = Z_ref self.net = net self.n_iters_per_epoch = n_iters_per_epoch # ideally a divisor of both n and n' # set optimizer self.optimizer = optim.Adam(net.parameters(), **kwargs) # logging self.writer = SummaryWriter( **writer_params ) # create a new folder under runs/ for logging self.n_iter = self.n_epoch = 0 # keep counts for logging def step(self, epochs=1): """ Carries out the gradient descend for a number of epochs and returns the divergence estimate evaluated over the entire data. Loss for each epoch is recorded into the log, but only one divergence estimate is computed/logged using the entire dataset. Rerun the method, using a loop, to continue to train the neural network and log the result. Parameters: ---------- epochs : number of epochs """ for i in range(epochs): self.n_epoch += 1 # random indices for selecting samples for all batches in one epoch idx = torch.randperm(self.Z.shape[0]) idx_ref = torch.randperm(self.Z_ref.shape[0]) for j in range(self.n_iters_per_epoch): self.n_iter += 1 self.optimizer.zero_grad() # obtain a random batch of samples batch_Z = self.Z[idx[i : self.Z.shape[0] : self.n_iters_per_epoch]] batch_Z_ref = self.Z_ref[ idx_ref[i : self.Z_ref.shape[0] : self.n_iters_per_epoch] ] # define the loss as negative DV divergence lower bound loss = -DV(batch_Z, batch_Z_ref, self.net) loss.backward() # calculate gradient self.optimizer.step() # descend self.writer.add_scalar( "Loss/train", loss.item(), global_step=self.n_iter ) with torch.no_grad(): estimate = DV(Z, Z_ref, self.net).item() self.writer.add_scalar("Estimate", estimate, global_step=self.n_epoch) return estimate ``` To use the above class to train, we first create an instance: ```{code-cell} ipython3 torch.manual_seed(SEED) net = Net().to(DEVICE) trainer = DVTrainer(Z, Z_ref, net, n_iters_per_epoch=10) ``` Next, run `step` iteratively to train the neural network: ```{code-cell} ipython3 if input("Train? [Y/n]").lower() != "n": for i in range(10): print("Divergence estimate:", trainer.step(10)) net.plot() ``` ```{code-cell} ipython3 %tensorboard --logdir=runs ``` ## Clean-up +++ It is important to release the resources if it is no longer used. You can release the memory or GPU memory by `Kernel->Shut Down Kernel`. +++ To clear the logs: ```{code-cell} ipython3 if input('Delete logs? [y/N]').lower() == 'y': !rm -rf ./runs ``` To kill a tensorboard instance without shutting down the notebook kernel: ```{code-cell} ipython3 tb.notebook.list() # list all the running TensorBoard notebooks. while (pid := input('pid to kill? (press enter to exit)')): !kill {pid} ```