Network Architecture

%matplotlib widget
from util import *

Mathematical Definition¶

open in new tab

As shown below, a neural network is organized into layers of computation units called the neurons.

For $\ell\in \{0,\dots,L\}$ and integer $L\geq 1$ , let

$\M{a}^{(\ell)}$ be the output of the $\ell$ -th layer of the neural network, and
$a^{(\ell)}_i$ be the $i$ -th element of $\M{a}^{(\ell)}$ . The element is computed from the output $\M{a}^{(\ell-1)}$ of its previous layer except for $\ell=0$ .

The 0-th layer is called the input layer, i.e.,

\M{a}^{(0)}:=\M{x}.

(1)

The $L$ -th layer $\M{a}^{(L)}$ is called the output layer. All other layers are called the hidden layers.

What should be the neural network output?

The goal is to train a classifier that predicts a label $\R{y}$ for an input feature $\RM{x}$ :

Instead of a hard-decision classifier is a function $f:\mc{X}\to \mc{Y}$ such that $f(\RM{x})$ predicts $\R{y}$ ,

we train a probabilistic classifier $q_{\R{y}|\RM{x}}$ that estimates $p_{\R{y}|\RM{x}}$ , i.e.,

\begin{align} [q_{\R{y}|\RM{x}}(y|\M{x})]_{y\in \mc{Y}} &:= \M{a}^{(L)}. \end{align}

(2)

For the MNIST dataset, a common goal is to classify the digit type of a handwritten digit. When given a handwritten digit,

a hard-decision classifier returns a digit type, and
a probabilistic classifier returns a distribution of the digit types.

Why train a probabilistic classifier?

A probabilistic classifer is more general and it can give a hard decision as well
$f(\RM{x}):=\arg\max_{y\in \mc{Y}} q_{\R{y}|\RM{x}}(y|\RM{x})$
(3)
by returning the estimated most likely digit type.

A neural network can model the distribution $p_{\R{y}|\RM{x}}(\cdot|\RM{x})$ better than $\R{y}$ because its output is continuous.

How to ensure $\M{a}^{(L)}$ is a valid probability vector?

The soft-max activation function is often used for the last layer:

\begin{align} \sigma^{(L)}\left(\left[\begin{smallmatrix}z^{(\ell)}_1 \\ \vdots \\ z^{(\ell)}_k\end{smallmatrix}\right]\right) := \frac{1}{\sum_{i=1}^k e^{z^{(\ell)}_i}} \left[\begin{smallmatrix}e^{z^{(\ell)}_1} \\ \vdots \\ e^{z^{(\ell)}_k}\end{smallmatrix}\right]\tag{soft-max} \end{align}

(4)

where $k:=\abs{\mc{Y}}=10$ is the number of distinct class labels.

It follows that:

\sum_{i=1}^k a_i^{(L)} = 1\kern1em \text{and} \kern1em a_i^{(L)}\geq 0\qquad \forall i\in \{1,\dots,k\}.

(5)

How are the different layers related?

\begin{align} \M{a}^{(\ell)}&:=\begin{cases} \M{x} & \ell=0\\ \sigma^{(\ell)}(\overbrace{\M{W}^{(\ell)}\M{a}^{(\ell-1)}+\M{b}^{(\ell)}}^{\RM{z}^{(\ell)}:=})& \ell>0; \end{cases}\tag{net} \end{align}

(6)

$\M{W}^{(\ell)}$ is a matrix of weights;
$\M{b}^{(\ell)}$ is a vector called bias; and
$\sigma^{(\ell)}$ is a reveal-valued function called the activation function.

The activation functions $\sigma^{(\ell)}$ for other layers $1\leq \ell<L$ is often the vectorized version of

sigmoid:

\sigma_{\text{sigmoid}}(z) = \frac{1}{1+e^{-z}}

(7)

rectified linear unit (ReLU):

\sigma_{\text{ReLU}}(z) = \max\{0,z\}.

(8)

open in new tab

The following plots the ReLU activation function.

def ReLU(z):
    return np.max([np.zeros(z.shape), z], axis=0)


z = np.linspace(-5, 5, 100)
plt.figure(num=1)
plt.plot(z, ReLU(z))
plt.xlim(-5, 5)
plt.title(r"ReLU: $\max\{0,z\}$")
plt.xlabel(r"$z$")
plt.show()

def sigmoid(z):
    ### BEGIN SOLUTION
    return 1 / (1 + np.exp(-z))
    ### END SOLUTION


z = np.linspace(-5, 5, 100)
plt.figure(num=2)
plt.plot(z, sigmoid(z))
plt.xlim(-5, 5)
plt.title(r"Sigmoid function: $\frac{1}{1+e^{-z}}$")
plt.xlabel(r"$z$")
plt.show()

# tests
### BEGIN HIDDEN TESTS
z_test = np.linspace(-5, 5, 10)
assert np.isclose(sigmoid(z_test), (lambda z: 1 / (1 + np.exp(-z)))(z_test)).all()
### END HIDDEN TESTS

Implementation¶

The following uses the keras library to define the basic neural network achitecture.

keras runs on top of tensorflow and offers a higher-level abstraction to simplify the construction and training of a neural network. (tflearn is another library that provides a higher-level API for tensorflow.)

def create_simple_model():
    tf.keras.backend.clear_session()  # clear keras cache.
    # See https://github.com/keras-team/keras/issues/7294
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Input(shape=(28, 28, 1)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
        ],
        "Simple_sequential",
    )
    return model


model = create_simple_model()
model.summary()

The above defines a linear stack of fully-connected layers after flattening the input. The method summary is useful for debugging in Keras.

### BEGIN SOLUTION
n_hidden_layers = len(model.layers) - 2
### END SOLUTION
n_hidden_layers

# tests
### BEGIN HIDDEN TESTS
assert n_hidden_layers == len(model.layers) - 2
### END HIDDEN TESTS