Training the Neural Network

\(\def\abs#1{\left\lvert #1 \right\rvert} \def\Set#1{\left\{ #1 \right\}} \def\mc#1{\mathcal{#1}} \def\M#1{\boldsymbol{#1}} \def\R#1{\mathsf{#1}} \def\RM#1{\boldsymbol{\mathsf{#1}}} \def\op#1{\operatorname{#1}} \def\E{\op{E}} \def\d{\mathrm{\mathstrut d}}\)

# init
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorboard as tb
import torch
import torch.optim as optim
from torch import Tensor, nn
from torch.nn import functional as F
from torch.utils.tensorboard import SummaryWriter

%load_ext tensorboard
%load_ext jdc
%matplotlib inline

SEED = 0

# create samples
XY_rng = np.random.default_rng(SEED)
rho = 1 - 0.19 * XY_rng.random()
mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000
XY = XY_rng.multivariate_normal(mean, cov, n)

XY_ref_rng = np.random.default_rng(SEED)
cov_ref, n_ = [[1, 0], [0, 1]], n
XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_)

We will train a neural network with torch and use GPU if available:

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

if DEVICE == "cuda":  # print current GPU name if available
    print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))

When GPU is available, you can use GPU dashboards on the left to monitor GPU utilizations.


How to train a neural network by gradient descent?

We will first consider a simple implementation followed by a more practical implementation.

A simple implementation of gradient descent

Consider solving for a given \(z\in \mathbb{R}\),

\[ \inf_{w\in \mathbb{R}} \overbrace{e^{w\cdot z}}^{L(w):=}.\]

We will train one parameter, namely, \(w\), to minimize the loss \(L(w)\).


What is the solution for \(z=-1\)?


With \(z=-1\),

\[ L(w) = e^{-w} \geq 0 \]

which is achievable with equality as \(w\to \infty\).

How to implement the loss function?

We will define the loss function using tensors:

z = Tensor([-1]).to(DEVICE)  # default tensor type on a designated device

def L(w):
    return (w * z).exp()


The function L is vectorized because Tensor operations follow the broadcasting rules of numpy:

ww = np.linspace(0, 10, 100)
ax = sns.lineplot(
    y=L(Tensor(ww).to(DEVICE)).cpu().numpy(),  # convert to numpy array for plotting
ax.set(xlabel=r"$w$", title=r"$L(w)=e^{-w}$")
ax.axhline(L(float("inf")), ls="--", c="r")

What is gradient descent?

A gradient descent algorithm updates the parameter \(w\) iteratively starting with some initial \(w_0\):

\[w_{i+1} = w_i - s_i \nabla L(w_i) \qquad \text{for }i\geq 0,\]

where \(s_i\) is the learning rate (step size).

How to compute the gradient?

With \(w=0\),

\[\nabla L(0) = \left.-e^{-w}\right|_{w=0}=-1,\]

which can be computed using backward (backpropagation):

w = Tensor([0]).to(DEVICE).requires_grad_()  # requires gradient calculation for w
L(w).backward()  # calculate the gradient by backpropagation

Under the hood, the function call L(w)

  • not only return the loss function evaluated at w, but also

  • updates a computational graph for calculating the gradient since w requires_grad_().

How to implement the gradient descent?

With a learning rate of 0.001:

for i in range(1000):
    w.grad = None  # zero the gradient to avoid accumulation
    with torch.no_grad():  # updates the weights in place without gradient calculation
        w -= w.grad * 1e-3

print("w:", w.item(), "\nL(w):", L(w).item())

What is torch.no_grad()?

It sets up a context where the computational graph will not be updated. In particular,

w -= w.grad * 1e-3

should not be differentiated in the subsequent calculations of the gradient.


Repeatedly run the above cell until you get L(w) below 0.001. How large is the value of w? What is the limitations of the simple gradient descent algorithm?


The value of w needs to be smaller than 6.9. The convergence can be slow, especially when the learning rate is small. Also, w can be far away from its optimal value even if L(w) is close to its minimum.

A practical implementation

For a neural network to approximate a sophisticated function, it should have many parameters (degrees of freedom).

How to define a neural network?

The following code defines a simple neural network with 3 fully-connected (fc) hidden layers:

Neural net


  • \(\M{W}_l\) and \(\M{b}_l\) are the weight and bias respectively for the linear transformation \(\M{W}_l \M{a}_l + \M{b}_l\) of the \(l\)-th layer; and

  • \(\sigma\) for the first 2 hidden layers is an activation function called the exponential linear unit (ELU).

class Net(nn.Module):
    def __init__(self, input_size=2, hidden_size=100, sigma=0.02):
        self.fc1 = nn.Linear(input_size, hidden_size)  # fully-connected (fc) layer
        self.fc2 = nn.Linear(hidden_size, hidden_size)  # layer 2
        self.fc3 = nn.Linear(hidden_size, 1)  # layer 3
        nn.init.normal_(self.fc1.weight, std=sigma)  #
        nn.init.constant_(self.fc1.bias, 0)
        nn.init.normal_(self.fc2.weight, std=sigma)
        nn.init.constant_(self.fc2.bias, 0)
        nn.init.normal_(self.fc3.weight, std=sigma)
        nn.init.constant_(self.fc3.bias, 0)

    def forward(self, z):
        a1 = F.elu(self.fc1(z))
        a2 = F.elu(self.fc2(a1))
        t = self.fc3(a2)
        return t

torch.manual_seed(SEED)  # seed RNG for PyTorch
net = Net().to(DEVICE)

The neural network is also a vectorized function. E.g., the following call net once to plots the density estimate of all \(t(\R{Z}_i)\)’s and \(t(\R{Z}'_i)\)’s.

Z = Tensor(XY).to(DEVICE)
Z_ref = Tensor(XY_ref).to(DEVICE)

tZ = (
    net(, Z_ref), dim=0))  # compute t(Z_i)'s and t(Z'_i)
    # output needs to be converted back to an array on CPU for plotting
    .cpu()  # copy back to CPU
    .detach()  # detach from current graph (no gradient calculation)
    .numpy()  # convert output back to numpy

tZ_df = pd.DataFrame(data=tZ, columns=["t"])
sns.kdeplot(data=tZ_df, x="t")

For 2D sample \((x,y)\in \mc{Z}\), we can plot the neural network \(t(x,y)\) as a heatmap. The following code adds a method plot to Net using jdc:

%%add_to Net
def plot(net, xmin=-5, xmax=5, ymin=-5, ymax=5, xgrids=50, ygrids=50, ax=None):
    """Plot a heat map of a neural network net. net can only have two inputs."""
    x, y = np.mgrid[xmin : xmax : xgrids * 1j, ymin : ymax : ygrids * 1j]
    xy = np.concatenate((x[:, :, None], y[:, :, None]), axis=2)
    with torch.no_grad():
        z = (
                        Tensor(x.reshape(-1, 1)).to(DEVICE),
                        Tensor(y.reshape(-1, 1)).to(DEVICE),
    if ax is None:
        ax = plt.gca()
    im = ax.pcolormesh(x, y, z, cmap="RdBu_r", shading="auto")
    ax.set(xlabel=r"$x$", ylabel=r"$y$", title=r"Heatmap of $t(z)$ for $z=(x,y)$")

To plot the heatmap:



Why are the values of \(t(\R{Z}_i)\)’s and \(t(\R{Z}'_i)\)’s concentrated around \(0\)?


The neural network parameters are all very close to \(0\) as we have set a small variance sigma=0.02 to initialize them randomly. Hence:

  • The linear transformation \(\M{W}_l (\cdot) + \M{b}_l\) is close to \(0\) for when the weight and bias are close to \(0\).

  • The ELU activation function \(\sigma\) is also close to \(0\) if its input is close to \(0\).

How to implements the divergence estimate?

We decompose the approximate divergence lower bound in (14) as follows:

\[ \begin{align} \op{DV}(\R{Z}^n,\R{Z'}^{n'},\theta) &:= \underbrace{\frac1{n} \sum_{i\in [n]} t(\R{Z}_i)}_{\text{(a)}} - \underbrace{\log \frac1{n'} \sum_{i\in [n']} e^{t(\R{Z}'_i)}}_{ \underbrace{\log \sum_{i\in [n']} e^{t(\R{Z}'_i)}}_{\text{(b)}} - \underbrace{\log n'}_{\text{(c)}}} \end{align} \]

where \(\theta\) is a tuple of parameters (weights and biases) of the neural network that computes \(t\):

\[ \theta := (\M{W}_l,\M{b}_l|l\in [3]). \]
def DV(Z, Z_ref, net):
    avg_tZ = net(Z).mean()  # (a)
    log_avg_etZ_ref = net(Z_ref).logsumexp(dim=0) - np.log(Z_ref.shape[0])  # (b) - (c)
    return avg_tZ - log_avg_etZ_ref

DV_estimate = DV(Z, Z_ref, net)


Why is it preferrable to use logsumexp(dim=0) instead of .exp().sum().log()? Try running

Tensor([100]).exp().log(), Tensor([100]).logsumexp(0)

in a separate console.


logsumexp(dim=0) is numerically more stable than .exp().mean().log() especially when the output of the exponential function is too large to be represented with the default floating point precision.

To calculate the gradient of the divergence estimate with respect to \(\theta\):

net.zero_grad()  # zero the gradient values of all neural network parameters
DV(Z, Z_ref, net).backward()  # calculate the gradient
a_param = next(net.parameters())

a_param is a (module) parameter in \(\theta\) retrieved from the parameter iterator parameters().


Check that the value of a_param.grad is non-zero. Is a_param a weight or a bias?


It should be the weight matrix \(\M{W}_1\) because the shape is torch.Size([100, 2]).

How to gradient descend?

We will use the Adam’s gradient descend algorithm implemented as an optimizer optim.Adam:

net = Net().to(DEVICE)
optimizer = optim.Adam(
    net.parameters(), lr=1e-3
)  # Allow Adam's optimizer to update the neural network parameters
optimizer.step()  # perform one step of the gradient descent

To alleviate the problem of overfitting, the gradient is often calculated on randomly chosen batches:

\[ \begin{align} \R{L}(\theta) := - \bigg[\frac1{\abs{\R{B}}} \sum_{i\in \R{B}} t(\R{Z}_i) - \log \frac1{\abs{\R{B}'}} \sum_{i\in \R{B}'} e^{t(\R{Z}'_i)} - \log \abs{\R{B}'} \bigg], \end{align} \]

which is the negative lower bound of the VD formula in (12) but on the minibatches

\[\R{Z}_{\R{B}}:=(\R{Z}_i\mid i\in \R{B})\quad \text{and}\quad \R{Z}'_{\R{B}'}\]

where \(\R{B}\) and \(\R{B}'\) are batches of uniformly randomly chosen indices from \([n]\) and \([n']\) respectively.

The neural network parameter is updated

\[ \theta_{j+1} := \theta_j - s_j \nabla \R{L}_j(\theta_j), \]

starting with a randomly initialized \(\theta_0\) where \(s_j>0\) is the learning rate and \(\R{L}_j\) is the loss evaluated on the \(j\)-th randomly chosen batches \(\R{B}_j\) and \(\R{B}'_j\).

The different batches are often obtained by

  • permuting the samples first, and then

  • partitioning the samples into batches.

This is illustrated by the figure below:

Minibatch gradient descent

n_iters_per_epoch = 10  # ideally a divisor of both n and n'
batch_size = int((Z.shape[0] + 0.5) / n_iters_per_epoch)
batch_size_ref = int((Z_ref.shape[0] + 0.5) / n_iters_per_epoch)

We will use tensorboard to show the training logs.
Rerun the following to start a new log, for instance, after a change of parameters.

if input("New log?[Y/n] ").lower() != "n":
    n_iter = n_epoch = 0  # keep counts for logging
    writer = SummaryWriter()  # create a new folder under runs/ for logging

The following code carries out Adam’s gradient descent on batch loss:

if input("Train? [Y/n]").lower() != "n":
    for i in range(10):  # loop through entire data multiple times
        n_epoch += 1

        # random indices for selecting samples for all batches in one epoch
        idx = torch.randperm(Z.shape[0])
        idx_ref = torch.randperm(Z_ref.shape[0])

        for j in range(n_iters_per_epoch):  # loop through multiple batches
            n_iter += 1

            # obtain a random batch of samples
            batch_Z = Z[idx[j : Z.shape[0] : n_iters_per_epoch]]
            batch_Z_ref = Z_ref[idx_ref[j : Z_ref.shape[0] : n_iters_per_epoch]]

            # define the loss as negative DV divergence lower bound
            loss = -DV(batch_Z, batch_Z_ref, net)
            loss.backward()  # calculate gradient
            optimizer.step()  # descend

        writer.add_scalar("Loss/train", loss.item(), global_step=n_epoch)

    # Estimate the divergence using all data
    with torch.no_grad():
        estimate = DV(Z, Z_ref, net).item()
        writer.add_scalar("Estimate", estimate, global_step=n_epoch)
        print("Divergence estimation:", estimate)

Run the following to show the losses and divergence estimate in tensorboard. You can rerun the above cell to train the neural network more.

%tensorboard --logdir=runs

The ground truth is given by

\[D(P_{\R{Z}}\|P_{\R{Z}'}) = \frac12 \log(1-\rho^2) \]

where \(\rho\) is the randomly generated correlation in the previous notebook.


Compute the ground truth using the formula above.

ground_truth = -0.5 * np.log(1 - rho ** 2)


See if you can get an estimate close to this value by training the neural network repeatedly as shown below.

Divergence estimate


It is a good idea to encapsulate the training by a class, so multiple configurations can be run without interfering each other:

class DVTrainer:
    Neural estimator for KL divergence based on the sample DV lower bound.

    Estimate D(P_Z||P_Z') using samples Z and Z' by training a network t to maximize
        avg(t(Z)) - log avg(e^t(Z'))


    Z, Z_ref : Tensors with first dimension indicing the samples of Z and Z' respect.
    net : The neural network t that take Z as input and output a real number for each sample.
    n_iters_per_epoch : Number of iterations per epoch.
    writer_params : Parameters to be passed to SummaryWriter for logging.

    # constructor
    def __init__(self, Z, Z_ref, net, n_iters_per_epoch, writer_params={}, **kwargs):
        self.Z = Z
        self.Z_ref = Z_ref = net
        self.n_iters_per_epoch = n_iters_per_epoch  # ideally a divisor of both n and n'

        # set optimizer
        self.optimizer = optim.Adam(net.parameters(), **kwargs)

        # logging
        self.writer = SummaryWriter(
        )  # create a new folder under runs/ for logging
        self.n_iter = self.n_epoch = 0  # keep counts for logging

    def step(self, epochs=1):
        Carries out the gradient descend for a number of epochs and returns
        the divergence estimate evaluated over the entire data.

        Loss for each epoch is recorded into the log, but only one divergence
        estimate is computed/logged using the entire dataset. Rerun the method,
        using a loop, to continue to train the neural network and log the result.

        epochs : number of epochs
        for i in range(epochs):
            self.n_epoch += 1

            # random indices for selecting samples for all batches in one epoch
            idx = torch.randperm(self.Z.shape[0])
            idx_ref = torch.randperm(self.Z_ref.shape[0])

            for j in range(self.n_iters_per_epoch):
                self.n_iter += 1

                # obtain a random batch of samples
                batch_Z = self.Z[idx[i : self.Z.shape[0] : self.n_iters_per_epoch]]
                batch_Z_ref = self.Z_ref[
                    idx_ref[i : self.Z_ref.shape[0] : self.n_iters_per_epoch]

                # define the loss as negative DV divergence lower bound
                loss = -DV(batch_Z, batch_Z_ref,
                loss.backward()  # calculate gradient
                self.optimizer.step()  # descend

                    "Loss/train", loss.item(), global_step=self.n_iter

        with torch.no_grad():
            estimate = DV(Z, Z_ref,
            self.writer.add_scalar("Estimate", estimate, global_step=self.n_epoch)
            return estimate

To use the above class to train, we first create an instance:

net = Net().to(DEVICE)
trainer = DVTrainer(Z, Z_ref, net, n_iters_per_epoch=10)

Next, run step iteratively to train the neural network:

if input("Train? [Y/n]").lower() != "n":
    for i in range(10):
        print("Divergence estimate:", trainer.step(10))
%tensorboard --logdir=runs


It is important to release the resources if it is no longer used. You can release the memory or GPU memory by Kernel->Shut Down Kernel.

To clear the logs:

if input('Delete logs? [y/N]').lower() == 'y':
    !rm -rf ./runs

To kill a tensorboard instance without shutting down the notebook kernel:

tb.notebook.list() # list all the running TensorBoard notebooks.
while (pid := input('pid to kill? (press enter to exit)')):
    !kill {pid}