# Training the Neural Network¶

\(\def\abs#1{\left\lvert #1 \right\rvert} \def\Set#1{\left\{ #1 \right\}} \def\mc#1{\mathcal{#1}} \def\M#1{\boldsymbol{#1}} \def\R#1{\mathsf{#1}} \def\RM#1{\boldsymbol{\mathsf{#1}}} \def\op#1{\operatorname{#1}} \def\E{\op{E}} \def\d{\mathrm{\mathstrut d}}\)

```
# init
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorboard as tb
import torch
import torch.optim as optim
from torch import Tensor, nn
from torch.nn import functional as F
from torch.utils.tensorboard import SummaryWriter
%load_ext tensorboard
%load_ext jdc
%matplotlib inline
SEED = 0
# create samples
XY_rng = np.random.default_rng(SEED)
rho = 1 - 0.19 * XY_rng.random()
mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000
XY = XY_rng.multivariate_normal(mean, cov, n)
XY_ref_rng = np.random.default_rng(SEED)
cov_ref, n_ = [[1, 0], [0, 1]], n
XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_)
```

We will train a neural network with `torch`

and use GPU if available:

```
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
if DEVICE == "cuda": # print current GPU name if available
print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
```

When GPU is available, you can use GPU dashboards on the left to monitor GPU utilizations.

**How to train a neural network by gradient descent?**

We will first consider a simple implementation followed by a more practical implementation.

## A simple implementation of gradient descent¶

Consider solving for a given \(z\in \mathbb{R}\),

We will train one parameter, namely, \(w\), to minimize the loss \(L(w)\).

**Exercise**

What is the solution for \(z=-1\)?

**Solution**

With \(z=-1\),

which is achievable with equality as \(w\to \infty\).

**How to implement the loss function?**

We will define the loss function using tensors:

```
z = Tensor([-1]).to(DEVICE) # default tensor type on a designated device
def L(w):
return (w * z).exp()
L(float("inf"))
```

The function `L`

is vectorized because `Tensor`

operations follow the broadcasting rules of `numpy`

:

```
ww = np.linspace(0, 10, 100)
ax = sns.lineplot(
x=ww,
y=L(Tensor(ww).to(DEVICE)).cpu().numpy(), # convert to numpy array for plotting
)
ax.set(xlabel=r"$w$", title=r"$L(w)=e^{-w}$")
ax.axhline(L(float("inf")), ls="--", c="r")
plt.show()
```

**What is gradient descent?**

A gradient descent algorithm updates the parameter \(w\) iteratively starting with some initial \(w_0\):

where \(s_i\) is the *learning rate* (*step size*).

**How to compute the gradient?**

With \(w=0\),

which can be computed using `backward`

(backpropagation):

```
w = Tensor([0]).to(DEVICE).requires_grad_() # requires gradient calculation for w
L(w).backward() # calculate the gradient by backpropagation
w.grad
```

Under the hood, the function call `L(w)`

not only return the loss function evaluated at

`w`

, but alsoupdates a computational graph for calculating the gradient since

`w`

`requires_grad_()`

.

**How to implement the gradient descent?**

With a learning rate of `0.001`

:

```
for i in range(1000):
w.grad = None # zero the gradient to avoid accumulation
L(w).backward()
with torch.no_grad(): # updates the weights in place without gradient calculation
w -= w.grad * 1e-3
print("w:", w.item(), "\nL(w):", L(w).item())
```

**What is torch.no_grad()?**

It sets up a context where the computational graph will not be updated. In particular,

```
w -= w.grad * 1e-3
```

should not be differentiated in the subsequent calculations of the gradient.

**Exercise**

Repeatedly run the above cell until you get `L(w)`

below `0.001`

. How large is the value of `w`

? What is the limitations of the simple gradient descent algorithm?

**Solution**

The value of `w`

needs to be smaller than `6.9`

. The convergence can be slow, especially when the learning rate is small. Also, `w`

can be far away from its optimal value even if `L(w)`

is close to its minimum.

## A practical implementation¶

For a neural network to approximate a sophisticated function, it should have many parameters (*degrees of freedom*).

**How to define a neural network?**

The following code defines a simple neural network with 3 fully-connected (fc) hidden layers:

where

\(\M{W}_l\) and \(\M{b}_l\) are the weight and bias respectively for the linear transformation \(\M{W}_l \M{a}_l + \M{b}_l\) of the \(l\)-th layer; and

\(\sigma\) for the first 2 hidden layers is an activation function called the

*exponential linear unit (ELU)*.

```
class Net(nn.Module):
def __init__(self, input_size=2, hidden_size=100, sigma=0.02):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size) # fully-connected (fc) layer
self.fc2 = nn.Linear(hidden_size, hidden_size) # layer 2
self.fc3 = nn.Linear(hidden_size, 1) # layer 3
nn.init.normal_(self.fc1.weight, std=sigma) #
nn.init.constant_(self.fc1.bias, 0)
nn.init.normal_(self.fc2.weight, std=sigma)
nn.init.constant_(self.fc2.bias, 0)
nn.init.normal_(self.fc3.weight, std=sigma)
nn.init.constant_(self.fc3.bias, 0)
def forward(self, z):
a1 = F.elu(self.fc1(z))
a2 = F.elu(self.fc2(a1))
t = self.fc3(a2)
return t
torch.manual_seed(SEED) # seed RNG for PyTorch
net = Net().to(DEVICE)
print(net)
```

The neural network is also a vectorized function. E.g., the following call `net`

once to plots the density estimate of all \(t(\R{Z}_i)\)’s and \(t(\R{Z}'_i)\)’s.

```
Z = Tensor(XY).to(DEVICE)
Z_ref = Tensor(XY_ref).to(DEVICE)
tZ = (
net(torch.cat((Z, Z_ref), dim=0)) # compute t(Z_i)'s and t(Z'_i)
# output needs to be converted back to an array on CPU for plotting
.cpu() # copy back to CPU
.detach() # detach from current graph (no gradient calculation)
.numpy() # convert output back to numpy
)
tZ_df = pd.DataFrame(data=tZ, columns=["t"])
sns.kdeplot(data=tZ_df, x="t")
plt.show()
```

For 2D sample \((x,y)\in \mc{Z}\), we can plot the neural network \(t(x,y)\) as a heatmap. The following code adds a method `plot`

to `Net`

using `jdc`

:

```
%%add_to Net
def plot(net, xmin=-5, xmax=5, ymin=-5, ymax=5, xgrids=50, ygrids=50, ax=None):
"""Plot a heat map of a neural network net. net can only have two inputs."""
x, y = np.mgrid[xmin : xmax : xgrids * 1j, ymin : ymax : ygrids * 1j]
xy = np.concatenate((x[:, :, None], y[:, :, None]), axis=2)
with torch.no_grad():
z = (
net(
torch.cat(
[
Tensor(x.reshape(-1, 1)).to(DEVICE),
Tensor(y.reshape(-1, 1)).to(DEVICE),
],
dim=-1,
)
)
.reshape(x.shape)
.cpu()
)
if ax is None:
ax = plt.gca()
im = ax.pcolormesh(x, y, z, cmap="RdBu_r", shading="auto")
ax.figure.colorbar(im)
ax.set(xlabel=r"$x$", ylabel=r"$y$", title=r"Heatmap of $t(z)$ for $z=(x,y)$")
```

To plot the heatmap:

```
net.plot()
```

**Exercise**

Why are the values of \(t(\R{Z}_i)\)’s and \(t(\R{Z}'_i)\)’s concentrated around \(0\)?

**Solution**

The neural network parameters are all very close to \(0\) as we have set a small variance `sigma=0.02`

to initialize them randomly. Hence:

The linear transformation \(\M{W}_l (\cdot) + \M{b}_l\) is close to \(0\) for when the weight and bias are close to \(0\).

The ELU activation function \(\sigma\) is also close to \(0\) if its input is close to \(0\).

**How to implements the divergence estimate?**

We decompose the approximate divergence lower bound in (14) as follows:

where \(\theta\) is a tuple of parameters (weights and biases) of the neural network that computes \(t\):

```
def DV(Z, Z_ref, net):
avg_tZ = net(Z).mean() # (a)
log_avg_etZ_ref = net(Z_ref).logsumexp(dim=0) - np.log(Z_ref.shape[0]) # (b) - (c)
return avg_tZ - log_avg_etZ_ref
DV_estimate = DV(Z, Z_ref, net)
```

**Exercise**

Why is it preferrable to use `logsumexp(dim=0)`

instead of `.exp().sum().log()`

? Try running

```
Tensor([100]).exp().log(), Tensor([100]).logsumexp(0)
```

in a separate console.

**Solution**

`logsumexp(dim=0)`

is numerically more stable than `.exp().mean().log()`

especially when the output of the exponential function is too large to be represented with the default floating point precision.

To calculate the gradient of the divergence estimate with respect to \(\theta\):

```
net.zero_grad() # zero the gradient values of all neural network parameters
DV(Z, Z_ref, net).backward() # calculate the gradient
a_param = next(net.parameters())
```

`a_param`

is a (module) parameter in \(\theta\) retrieved from the parameter iterator `parameters()`

.

**Exercise**

Check that the value of `a_param.grad`

is non-zero. Is `a_param`

a weight or a bias?

**Solution**

It should be the weight matrix \(\M{W}_1\) because the shape is `torch.Size([100, 2])`

.

**How to gradient descend?**

We will use the *Adam’s* gradient descend algorithm implemented as an optimizer `optim.Adam`

:

```
net = Net().to(DEVICE)
optimizer = optim.Adam(
net.parameters(), lr=1e-3
) # Allow Adam's optimizer to update the neural network parameters
optimizer.step() # perform one step of the gradient descent
```

To alleviate the problem of overfitting, the gradient is often calculated on randomly chosen batches:

which is the negative lower bound of the VD formula in (12) but on the minibatches

where \(\R{B}\) and \(\R{B}'\) are batches of uniformly randomly chosen indices from \([n]\) and \([n']\) respectively.

The neural network parameter is updated

starting with a randomly initialized \(\theta_0\) where \(s_j>0\) is the learning rate and \(\R{L}_j\) is the loss evaluated on the \(j\)-th randomly chosen batches \(\R{B}_j\) and \(\R{B}'_j\).

The different batches are often obtained by

permuting the samples first, and then

partitioning the samples into batches.

This is illustrated by the figure below:

```
n_iters_per_epoch = 10 # ideally a divisor of both n and n'
batch_size = int((Z.shape[0] + 0.5) / n_iters_per_epoch)
batch_size_ref = int((Z_ref.shape[0] + 0.5) / n_iters_per_epoch)
```

We will use `tensorboard`

to show the training logs.

Rerun the following to start a new log, for instance, after a change of parameters.

```
if input("New log?[Y/n] ").lower() != "n":
n_iter = n_epoch = 0 # keep counts for logging
writer = SummaryWriter() # create a new folder under runs/ for logging
```

The following code carries out Adam’s gradient descent on batch loss:

```
if input("Train? [Y/n]").lower() != "n":
for i in range(10): # loop through entire data multiple times
n_epoch += 1
# random indices for selecting samples for all batches in one epoch
idx = torch.randperm(Z.shape[0])
idx_ref = torch.randperm(Z_ref.shape[0])
for j in range(n_iters_per_epoch): # loop through multiple batches
n_iter += 1
optimizer.zero_grad()
# obtain a random batch of samples
batch_Z = Z[idx[j : Z.shape[0] : n_iters_per_epoch]]
batch_Z_ref = Z_ref[idx_ref[j : Z_ref.shape[0] : n_iters_per_epoch]]
# define the loss as negative DV divergence lower bound
loss = -DV(batch_Z, batch_Z_ref, net)
loss.backward() # calculate gradient
optimizer.step() # descend
writer.add_scalar("Loss/train", loss.item(), global_step=n_epoch)
# Estimate the divergence using all data
with torch.no_grad():
estimate = DV(Z, Z_ref, net).item()
writer.add_scalar("Estimate", estimate, global_step=n_epoch)
net.plot()
print("Divergence estimation:", estimate)
```

Run the following to show the losses and divergence estimate in `tensorboard`

. You can rerun the above cell to train the neural network more.

```
%tensorboard --logdir=runs
```

The ground truth is given by

where \(\rho\) is the randomly generated correlation in the previous notebook.

**Exercise**

Compute the ground truth using the formula above.

```
### BEGIN SOLUTION
ground_truth = -0.5 * np.log(1 - rho ** 2)
### END SOLUTION
ground_truth
```

**Exercise**

See if you can get an estimate close to this value by training the neural network repeatedly as shown below.

## Encapsulation¶

It is a good idea to encapsulate the training by a class, so multiple configurations can be run without interfering each other:

```
class DVTrainer:
"""
Neural estimator for KL divergence based on the sample DV lower bound.
Estimate D(P_Z||P_Z') using samples Z and Z' by training a network t to maximize
avg(t(Z)) - log avg(e^t(Z'))
Parameters:
----------
Z, Z_ref : Tensors with first dimension indicing the samples of Z and Z' respect.
net : The neural network t that take Z as input and output a real number for each sample.
n_iters_per_epoch : Number of iterations per epoch.
writer_params : Parameters to be passed to SummaryWriter for logging.
"""
# constructor
def __init__(self, Z, Z_ref, net, n_iters_per_epoch, writer_params={}, **kwargs):
self.Z = Z
self.Z_ref = Z_ref
self.net = net
self.n_iters_per_epoch = n_iters_per_epoch # ideally a divisor of both n and n'
# set optimizer
self.optimizer = optim.Adam(net.parameters(), **kwargs)
# logging
self.writer = SummaryWriter(
**writer_params
) # create a new folder under runs/ for logging
self.n_iter = self.n_epoch = 0 # keep counts for logging
def step(self, epochs=1):
"""
Carries out the gradient descend for a number of epochs and returns
the divergence estimate evaluated over the entire data.
Loss for each epoch is recorded into the log, but only one divergence
estimate is computed/logged using the entire dataset. Rerun the method,
using a loop, to continue to train the neural network and log the result.
Parameters:
----------
epochs : number of epochs
"""
for i in range(epochs):
self.n_epoch += 1
# random indices for selecting samples for all batches in one epoch
idx = torch.randperm(self.Z.shape[0])
idx_ref = torch.randperm(self.Z_ref.shape[0])
for j in range(self.n_iters_per_epoch):
self.n_iter += 1
self.optimizer.zero_grad()
# obtain a random batch of samples
batch_Z = self.Z[idx[i : self.Z.shape[0] : self.n_iters_per_epoch]]
batch_Z_ref = self.Z_ref[
idx_ref[i : self.Z_ref.shape[0] : self.n_iters_per_epoch]
]
# define the loss as negative DV divergence lower bound
loss = -DV(batch_Z, batch_Z_ref, self.net)
loss.backward() # calculate gradient
self.optimizer.step() # descend
self.writer.add_scalar(
"Loss/train", loss.item(), global_step=self.n_iter
)
with torch.no_grad():
estimate = DV(Z, Z_ref, self.net).item()
self.writer.add_scalar("Estimate", estimate, global_step=self.n_epoch)
return estimate
```

To use the above class to train, we first create an instance:

```
torch.manual_seed(SEED)
net = Net().to(DEVICE)
trainer = DVTrainer(Z, Z_ref, net, n_iters_per_epoch=10)
```

Next, run `step`

iteratively to train the neural network:

```
if input("Train? [Y/n]").lower() != "n":
for i in range(10):
print("Divergence estimate:", trainer.step(10))
net.plot()
```

```
%tensorboard --logdir=runs
```

## Clean-up¶

It is important to release the resources if it is no longer used. You can release the memory or GPU memory by `Kernel->Shut Down Kernel`

.

To clear the logs:

```
if input('Delete logs? [y/N]').lower() == 'y':
!rm -rf ./runs
```

To kill a tensorboard instance without shutting down the notebook kernel:

```
tb.notebook.list() # list all the running TensorBoard notebooks.
while (pid := input('pid to kill? (press enter to exit)')):
!kill {pid}
```