Skip to article frontmatterSkip to article content

Network Architecture

City University of Hong Kong
%matplotlib widget
from util import *

Mathematical Definition

As shown below, a neural network is organized into layers of computation units called the neurons.

For {0,,L}\ell\in \{0,\dots,L\} and integer L1L\geq 1, let

  • a()\M{a}^{(\ell)} be the output of the \ell-th layer of the neural network, and
  • ai()a^{(\ell)}_i be the ii-th element of a()\M{a}^{(\ell)}. The element is computed from the output a(1)\M{a}^{(\ell-1)} of its previous layer except for =0\ell=0.

The 0-th layer is called the input layer, i.e.,

a(0):=x.\M{a}^{(0)}:=\M{x}.

The LL-th layer a(L)\M{a}^{(L)} is called the output layer. All other layers are called the hidden layers.

What should be the neural network output?

The goal is to train a classifier that predicts a label y\R{y} for an input feature x\RM{x}:

  • Instead of a hard-decision classifier is a function f:XYf:\mc{X}\to \mc{Y} such that f(x)f(\RM{x}) predicts y\R{y},
  • we train a probabilistic classifier qyxq_{\R{y}|\RM{x}} that estimates pyxp_{\R{y}|\RM{x}}, i.e.,
[qyx(yx)]yY:=a(L).\begin{align} [q_{\R{y}|\RM{x}}(y|\M{x})]_{y\in \mc{Y}} &:= \M{a}^{(L)}. \end{align}

For the MNIST dataset, a common goal is to classify the digit type of a handwritten digit. When given a handwritten digit,

  • a hard-decision classifier returns a digit type, and
  • a probabilistic classifier returns a distribution of the digit types.

Why train a probabilistic classifier?

  • A probabilistic classifer is more general and it can give a hard decision as well

    f(x):=argmaxyYqyx(yx)f(\RM{x}):=\arg\max_{y\in \mc{Y}} q_{\R{y}|\RM{x}}(y|\RM{x})

    by returning the estimated most likely digit type.

  • A neural network can model the distribution pyx(x)p_{\R{y}|\RM{x}}(\cdot|\RM{x}) better than y\R{y} because its output is continuous.

How to ensure a(L)\M{a}^{(L)} is a valid probability vector?

The soft-max activation function is often used for the last layer:

σ(L)([z1()zk()]):=1i=1kezi()[ez1()ezk()]\begin{align} \sigma^{(L)}\left(\left[\begin{smallmatrix}z^{(\ell)}_1 \\ \vdots \\ z^{(\ell)}_k\end{smallmatrix}\right]\right) := \frac{1}{\sum_{i=1}^k e^{z^{(\ell)}_i}} \left[\begin{smallmatrix}e^{z^{(\ell)}_1} \\ \vdots \\ e^{z^{(\ell)}_k}\end{smallmatrix}\right]\tag{soft-max} \end{align}

where k:=Y=10k:=\abs{\mc{Y}}=10 is the number of distinct class labels.

It follows that:

i=1kai(L)=1andai(L)0i{1,,k}.\sum_{i=1}^k a_i^{(L)} = 1\kern1em \text{and} \kern1em a_i^{(L)}\geq 0\qquad \forall i\in \{1,\dots,k\}.

How are the different layers related?

a():={x=0σ()(W()a(1)+b()z():=)>0;\begin{align} \M{a}^{(\ell)}&:=\begin{cases} \M{x} & \ell=0\\ \sigma^{(\ell)}(\overbrace{\M{W}^{(\ell)}\M{a}^{(\ell-1)}+\M{b}^{(\ell)}}^{\RM{z}^{(\ell)}:=})& \ell>0; \end{cases}\tag{net} \end{align}
  • W()\M{W}^{(\ell)} is a matrix of weights;
  • b()\M{b}^{(\ell)} is a vector called bias; and
  • σ()\sigma^{(\ell)} is a reveal-valued function called the activation function.

The activation functions σ()\sigma^{(\ell)} for other layers 1<L1\leq \ell<L is often the vectorized version of

  • sigmoid:
σsigmoid(z)=11+ez\sigma_{\text{sigmoid}}(z) = \frac{1}{1+e^{-z}}
  • rectified linear unit (ReLU):
σReLU(z)=max{0,z}.\sigma_{\text{ReLU}}(z) = \max\{0,z\}.

The following plots the ReLU activation function.

def ReLU(z):
    return np.max([np.zeros(z.shape), z], axis=0)


z = np.linspace(-5, 5, 100)
plt.figure(num=1)
plt.plot(z, ReLU(z))
plt.xlim(-5, 5)
plt.title(r"ReLU: $\max\{0,z\}$")
plt.xlabel(r"$z$")
plt.show()
def sigmoid(z):
    ### BEGIN SOLUTION
    return 1 / (1 + np.exp(-z))
    ### END SOLUTION


z = np.linspace(-5, 5, 100)
plt.figure(num=2)
plt.plot(z, sigmoid(z))
plt.xlim(-5, 5)
plt.title(r"Sigmoid function: $\frac{1}{1+e^{-z}}$")
plt.xlabel(r"$z$")
plt.show()
# tests
### BEGIN HIDDEN TESTS
z_test = np.linspace(-5, 5, 10)
assert np.isclose(sigmoid(z_test), (lambda z: 1 / (1 + np.exp(-z)))(z_test)).all()
### END HIDDEN TESTS

Implementation

The following uses the keras library to define the basic neural network achitecture.

keras runs on top of tensorflow and offers a higher-level abstraction to simplify the construction and training of a neural network. (tflearn is another library that provides a higher-level API for tensorflow.)

def create_simple_model():
    tf.keras.backend.clear_session()  # clear keras cache.
    # See https://github.com/keras-team/keras/issues/7294
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Input(shape=(28, 28, 1)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
        ],
        "Simple_sequential",
    )
    return model


model = create_simple_model()
model.summary()

The above defines a linear stack of fully-connected layers after flattening the input. The method summary is useful for debugging in Keras.

### BEGIN SOLUTION
n_hidden_layers = len(model.layers) - 2
### END SOLUTION
n_hidden_layers
# tests
### BEGIN HIDDEN TESTS
assert n_hidden_layers == len(model.layers) - 2
### END HIDDEN TESTS