Training and Testing

from util import *

Optimization Theory¶

Why can we learn from examples?

Why?

If $\R{y}$ were deterministic, i.e., $\R{y}=y$ all the time, then the classifier can simply return $y$ without even looking at $\RM{x}$ .
If $p_{\RM{x},\R{y}}$ were known instead, then $p_{\R{y}|\RM{x}}$ would also be known and therefore needed not be estimated.

More precisely, the examples are called i.i.d. samples of $(\RM{x},\R{y})$ , written as

(\RM{x}_1,\R{y}_1), (\RM{x}_2,\R{y}_2), \dots\sim_{\text{i.i.d.}} p_{\RM{x},\R{y}},

(1)

which means that their joint distribution is $p_{\RM{x}_1,\R{y}_1, \RM{x}_2,\R{y}_2, \dots}(x_1,y_1,x_2,y_2,\dots) = p_{\RM{x},\R{y}}(x_1,y_1) p_{\RM{x},\R{y}}(x_2,y_2)\cdots$ .

Why?

If all the examples were the same instead, they could not show the pattern of how $\R{y}$ depends on $\RM{x}$ .
Noise in individual examples can be smoothed out by averaging out the examples.

How to determine if a classifier is good?

Ultimately, we desire a classifier with the maximum accuracy in predicting $\R{y}$ but doing so is computationally too difficult.

Instead, we regard a classification algorithm to be reasonably good if

it can achieve the maximum possible accuracy
as the number of training samples goes to ∞.

Definition 1

A probabilistic classifier for the input feature $\RM{x}$ and label $\R{y}$ with unknown joint distribution is a conditional pmf

\R{q}_{\R{y}|\RM{x}}(y|\RM{x})\qquad \text{for }\M{x}\in \mc{X}, y\in \mc{Y}

(2)

defined as a function of the i.i.d. samples

\{(\RM{x}_i,\R{y}_i)\}_{i=1}^N

(3)

of $(\RM{x},\R{y})$ (but independent of $(\RM{x},\R{y})$ ). The classifier is said to be a consistent estimate (of $p_{\R{y}|\M{x}}$ ) if

\begin{align} \lim_{N\to \infty} \Pr\Set{\R{q}_{\R{y}|\RM{x}}(y|\RM{x})=p_{\R{y}|\RM{x}}(y|\RM{x})\text{ for all } y\in \mc{Y}}=1,\tag{consistency} \end{align}

(4)

namely, $\R{q}_{\R{y}|\RM{x}}(y|\RM{x})$ converges almost surely (a.s.) to $p_{\R{y}|\RM{x}}$ . $\square$

A consistent probabilistic classifier gives rise to an asymptotically optimal hard-decision classifier that achieves the maximum accuracy.

Proposition 1

If for some $\epsilon\geq 0$ that

\Pr\Set{\R{q}_{\R{y}|\RM{x}}(y|\RM{x})=p_{\R{y}|\RM{x}}(y|\RM{x}) \text{ for all } y\in \mc{Y}}=1-\epsilon,

(5)

the (hard-decision) classifier

\begin{align}\R{f}(\RM{x}):=\arg\max_{y\in \mc{Y}} \R{q}_{\R{y}|\RM{x}}(y|\RM{x})\tag{hard decision}\end{align}

(6)

achieves an accuracy

\begin{align} \sup_{f:\mc{X}\to \mc{Y}} \Pr(\R{y}= f(\RM{x})) &\geq E\left[\max_{y\in \mc{Y}} p_{\R{y}|\M{x}}(y|\RM{x})\right] - \epsilon,\tag{accuracy lower bound} \end{align}

(7)

where the expectation is the maximum possible accuracy. $\square$

How can we obtain a consistent classifier?

We train a neural network to minimize certain loss function.

A common loss function for classification uses the cross entropy measure in information theory.

Important

The theoretical underpinning is the following identity that relates three information quantities:

\begin{align} \overbrace{E\left[\log \frac{1}{q_{\R{y}|\RM{x}}(\R{y}|\RM{x})}\right]}^{ \text{Cross entropy}\\ H(p_{\R{y}|\RM{x}}\|q_{\R{y}|\RM{x}}|p_{\RM{x}}):=} &\equiv \overbrace{E\left[\log \frac{1}{p_{\R{y}|\RM{x}}(\R{y}|\RM{x})}\right]}^{\text{Conditional entropy}\\ H(\R{y}|\RM{x}):=} + \overbrace{E\left[\log \frac{p_{\R{y}|\RM{x}}(\R{y}|\RM{x})}{q_{\R{y}|\RM{x}}(\R{y}|\RM{x})}\right].}^{\text{Divergence}\\ D(p_{\R{y}|\RM{x}}\|q_{\R{y}|\RM{x}}|p_{\RM{x}}):=}\tag{identity} \end{align}

(9)

The identity can be proved quite easily using the linearity of expectation

\begin{align} E[\R{u}+\R{v}]=E[\R{u}]+E[\R{v}] \tag{linearity of expectation} \end{align}

(10)

and a property of logarithm that $\log uv = \log u+ \log v$ for all $u,v>0$ .

open in new tab

Hence, a neural network that minimizes the cross entropy equals $p_{\R{y}|\RM{x}}(y|\RM{x})$ a.s. for all $y\in \mc{Y}$ and any possible input image $\RM{x}$ .

Exercise 1

Prove the above proposition using the information identity and the property of divergence that

\begin{align} D(p_{\R{y}|\RM{x}}\|q_{\R{y}|\RM{x}}|p_{\RM{x}})\geq 0 \tag{positivity of divergence} \end{align}

(12)

with equality iff $q_{\R{y}|\RM{x}}(y|\RM{x})=p_{\R{y}|\RM{x}}(y|\RM{x})$ a.s.

Solution to Exercise 1

Proof: Applying the positivity of divergence to the information identity, we have

\begin{align} H(p_{\R{y}|\RM{x}}\|q_{\R{y}|\RM{x}}|p_{\RM{x}}) &\geq H(\R{y}|\RM{x}) \end{align}

(14)

with equality if and only if $D(p_{\R{y}|\RM{x}}\|q_{\R{y}|\RM{x}}|p_{\RM{x}})=0$ . Hence, the cross entropy is minimized to the conditional entropy by having $q_{\R{y}|\RM{x}}(y|\RM{x})=p_{\R{y}|\RM{x}}(y|\RM{x})$ a.s. $\blacksquare$

The cross entropy cannot be computed exactly without knowing the joint distribution $p_{\RM{x}\R{y}}$ . Nevertheless, it can be estimated from a batch of i.i.d. samples $(\RM{x}_{\R{j}_i},\R{y}_{\R{j}_i})$ for $1\leq i\leq n$ :

\begin{align} \R{L}(\theta)&:=\frac1n \sum_{i=1}^n \log \frac{1}{q_{\R{y}|\RM{x}}(\R{y}_{\R{j}_i}|\RM{x}_{\R{j}_i})}\tag{empirical loss} \end{align}

(15)

where

\theta := \operatorname{flat}(\M{W}^{(\ell)},\M{b}^{(\ell)}\mid 0\leq \ell \leq L)

(16)

is the vector of parameters of the neural network defined in (net).

A mini-batch gradient descent algorithm is often used to reduce the loss. It iteratively updates/trains the neural network parameters:

\theta \leftarrow \theta -s\nabla \R{L}(\theta)

(17)

by computing the gradient $\nabla \R{L}(\theta)$ on a randomly selected minibatch of examples and choosing an appropriate learning rate $s$ .

What is gradient descent?

open in new tab

How to choose the step size?

open in new tab

The gradient can be computed systematically using a technique called backpropagation due to the structure of the neural network in (net).
The learning rate can affect the convergence rate of the loss to a local minima:
- θ may overshoot its optimal value if $s$ is too large, and
- the convergence can be very slow if θ is too small.

A more advanced method called Adam (Adaptive Momentum Estimation) can adaptively choose $s$ to speed up the convergence.

Training¶

The loss function, gradient descent algorithm, and the performance metrics can be specified using the compile method.

def compile_model(model):
    model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(0.001),
                  metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
    return model

compile_model(model)
model.loss, model.optimizer

We can train the neural network using the method fit of the compiled model:

if input('Train? [Y/n]').lower() != 'n':
    model.fit(ds_b["train"])

Solution to Exercise 2

The accuracy increases at a diminishing rate as we rerun the training. This is because the gradient descent algorithm iteratively reduces the loss.

We can set the parameter epochs to train the neural network for multiple epochs since it is quite unlikely to train a neural network well with just one epoch.

To determine whether the neural network is well-trained (when to stop training), we should also use a separate validation set to evaluate the performance of the neural network. The validation set can be specified using the parameter validation_set as follows:

if input('Train? [Y/n]').lower() != 'n':
    model.fit(ds_b["train"], epochs=6, validation_data=ds_b["test"])

Solution to Exercise 3

It is biased since the selection of the model depends on the validation accuracy, and therefore, the validation set. To avoid such bias, we should use a separate test set to evaluate the performance of the well-trained neural network at the end.

Deployment¶

Once you are satisfied with the result, you can deploy the model as a web application.

The mnist folder contain the webpage index.html that

presents an HTML5 canvas for users to input a handwritten digit,
loads a trained model using tensorflow.js,
passes the handwritten digit to the model to predict the distribution of the digit types, and
display the most likely digit type.

index.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
  <div style="text-align: center;">
    <canvas id="sketchpad" style="border-style:solid;"></canvas>
    <br>
    <button onclick="sketchpad.undo()">undo</button>
    <button onclick="sketchpad.redo()">redo</button>
    <button onclick="sketchpad.clear()">clear</button>
    <button onclick="predict()">predict</button><br>
    <input oninput="sketchpad.penSize=self.val()" id="size-picker" type="range" min="1" max="50">
    <br>
    <div style="display: flex; justify-content: center;">
      <canvas id="input" width="28" height="28"></canvas>
      <p id="result"></p>
    </div>
  </div>
    <script src="https://cdn.jsdelivr.net/npm/jquery@1.11.1/dist/jquery.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/sketchpad@0.1.0/scripts/sketchpad.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
    <script>
        const context = document.querySelector("#input").getContext('2d');
        const sketchpad = new Sketchpad({
          element: '#sketchpad',
          width: 280,
          height: 280
        });
        sketchpad.penSize = 25;
        $('#size-picker').val(sketchpad.penSize);
        $('#size-picker').change(function (event) {sketchpad.penSize = $(event.target).val()});
        
        let model;
        addEventListener('DOMContentLoaded', (async function () {
            model = await tf.loadLayersModel('model/model.json');
        }));

        function predict() {
            var img = new Image();
            img.onload = async function() {
                context.clearRect(0, 0, 28, 28);
                context.drawImage(img, 0, 0, 28, 28);
                data = context.getImageData(0, 0, 28, 28).data;
                var input = [];
                for(var i = 0; i < data.length; i += 4) {
                    input.push(data[i + 3] / 255);
                }
                scores = await model.predict(tf.tensor(input).reshape([1, 28, 28, 1])).array();
                scores = scores[0];
                $('#result').text('is classified as ' + scores.indexOf(Math.max(...scores)) + '.');
            };
            img.src = sketchpad.canvas.toDataURL('image/png');
        }
    </script>
  </body>
</html>

Then, convert the model to files that can be loaded by tensorflow.js:

import tensorflowjs as tfjs

tfjs.converters.save_keras_model(model, "mnist/model")

Note

Alternatively, you may also save the model as follows in HDF5 format as

model.save('mnist/model.h5')

and then convert it by running the following command in a terminal:

tensorflowjs_converter --input_format keras 'mnist/model.h5' 'mnist/model'

To host the web application, run the following command:

if input('Execute? [Y/n]').lower() != 'n':
    !mkdir -p ~/www/ && cp -r mnist ~/www/

View your web app here:

display.IFrame(src=JUPYTER_SERVICE_PREFIX + 'www/mnist/', width=500, height=400)