Skip to article frontmatterSkip to article content

Information Quantities for Probabilistic Classifiers

City University of Hong Kong

This notebook will introduce the information quantities often used for training probabilistic classifiers.

As an example, the following handwritten digit classifier is trained by deep learning using cross entropy loss:

  1. Handwrite a digit from 0, ..., 9.
  2. Click predict to see if the app can recognize the digit.

Open in new tab

Information Divergence

A fundamental property of mutual information is that:

To show this, we think of the mutual information as a statistical distance called the Kullback-Leibler divergence:

For the information divergence to be called a divergence, it has to satisfy the following property:

YOUR ANSWER HERE

Cross Entropy

A probabilistic classifier returns a conditional probability estimate P^YX\hat{P}_{Y|X} as a function of the training data SS, which consists of i.i.d. samples of (X,Y)(X,Y) but independent of (X,Y)(X,Y).

A sensible choice of the loss function is

(P^YX(x),PYX(x)):=D(P^YX(x)PYX(x))\ell(\hat{P}_{Y|X}(\cdot|x), P_{Y|X}(\cdot|x)):=D(\hat{P}_{Y|X}(\cdot|x)\|P_{Y|X}(\cdot|x))

because, by the positivity of divergence (Lemma Lemma 1), the above loss is non-negative and equal to 0 iff PX×P^YX=PX×PYXP_X\times \hat{P}_{Y|X}=P_X \times P_{Y|X} almost surely. Using this loss function, we have a simple bias-variance trade-off:

  • The variance I(S;Y^X)I(S;\hat{Y}|X) (also I(S;X,Y^)I(S;X,\hat{Y}) as I(S;X)=0I(S;X)=0) reflects the level of overfitting as it measures how much the estimate depends on the training data.
  • The bias D(E[P^YX]PYXPX)D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X) reflects the level of underfitting as it measures how much the expected estimate

The loss in (4), however, cannot be evaluated on SS for training because PYX(xi)P_{Y|X}(\cdot|x_i) is not known. Instead, we often use the cross entropy loss

(P^YX(x),y):=log1p^YX(yx).\ell(\hat{P}_{Y|X}(\cdot |x),y) := \log \frac{1}{\hat{p}_{Y|X}(y|x)}.

YOUR ANSWER HERE