CS5483

This notebook will introduce the information quantities often used for training probabilistic classifiers.

As an example, the following handwritten digit classifier is trained by deep learning using cross entropy loss:

Handwrite a digit from 0, ..., 9.
Click predict to see if the app can recognize the digit.

Open in new tab

Information Divergence¶

A fundamental property of mutual information is that:

To show this, we think of the mutual information as a statistical distance called the Kullback-Leibler divergence:

For the information divergence to be called a divergence, it has to satisfy the following property:

Proof 1

Without loss of generality, we can rewrite the divergence as the expectation:

\begin{align} D(P_{Z}\|P_{Z'}) &:= E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')} \log\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]\\ &\geq E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')}\right] \log \underbrace{E\left[\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]}_{=1} = 0, \end{align}

(3)

where the last inequality follows from Jensen’s inequality and the convexity of $r \mapsto r \log r$ . Since $r$ is strictly convex, the inequality holds iff $\frac{p_{Z}(Z')}{p_{Z'}(Z')}=1$ almost surely, i.e., $P_{Z}=P_{Z'}$ almost everywhere.

YOUR ANSWER HERE

Cross Entropy¶

A probabilistic classifier returns a conditional probability estimate $\hat{P}_{Y|X}$ as a function of the training data $S$ , which consists of i.i.d. samples of $(X,Y)$ but independent of $(X,Y)$ .

A sensible choice of the loss function is

\ell(\hat{P}_{Y|X}(\cdot|x), P_{Y|X}(\cdot|x)):=D(\hat{P}_{Y|X}(\cdot|x)\|P_{Y|X}(\cdot|x))

(4)

because, by the positivity of divergence (Lemma Lemma 1), the above loss is non-negative and equal to 0 iff $P_X\times \hat{P}_{Y|X}=P_X \times P_{Y|X}$ almost surely. Using this loss function, we have a simple bias-variance trade-off:

Theorem 2 (Bias-variance trade-off)

The expected loss (risk) for the loss function (4) is

\begin{align} E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)] = \overbrace{I(S;\hat{Y}|X)}^{\text{Variance}} + \overbrace{D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X)}^{\text{Bias}} \end{align}

(5)

where $\hat{Y}$ is distributed according to

\begin{align} P_{X,Y,S,\hat{Y}}&=P_{X,Y}\times P_{S} \times P_{\hat{Y}|X,S} && \text{where}\\ P_{\hat{Y}|X,S}(y|x,s) &= \hat{P}_{Y|X}(y|x) && \text{for }(x,y,s)\in \mathcal{X}\times \mathcal{Y}\times \mathcal{S}. \end{align}

(6)

The variance $I(S;\hat{Y}|X)$ (also $I(S;X,\hat{Y})$ as $I(S;X)=0$ ) reflects the level of overfitting as it measures how much the estimate depends on the training data.
The bias $D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X)$ reflects the level of underfitting as it measures how much the expected estimate

Proof 2

\begin{align} E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)] &= E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{p_{Y|X}(\hat{Y}|X)}\right]\\ &= \underbrace{E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{E[\hat{p}_{Y|X}](\hat{Y}|X)}\right]}_{\text{(i)}} + \underbrace{E\left[\log \frac{E[\hat{p}_{Y|X}](\hat{Y}|X)]}{p_{Y|X}(\hat{Y}|X)}\right]}_{=D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X) \text{(bias)}} \end{align}

(7)

It remains to show (i) is the variance. By (6),

\begin{align} E[\hat{p}_{Y|X}](y|x) &= E[p_{\hat{Y}|X,S}(y|x,S)|X=x]\\ &= p_{\hat{Y}|X}(y|x). \end{align}

(8)

Substituting (6) and the above into (i), we have

\begin{align} \text{(i)} &= E\left[\log \frac{p_{\hat{Y}|X,S}(\hat{Y}|X,S)}{p_{\hat{Y}|X}(\hat{Y}|X)}\right]\\ &= I(S;\hat{Y}|X), \end{align}

(9)

which completes the proof.

The loss in (4), however, cannot be evaluated on $S$ for training because $P_{Y|X}(\cdot|x_i)$ is not known. Instead, we often use the cross entropy loss

\ell(\hat{P}_{Y|X}(\cdot |x),y) := \log \frac{1}{\hat{p}_{Y|X}(y|x)}.

(10)

YOUR ANSWER HERE

Information Quantities for Probabilistic Classifiers

Information Divergence¶

Cross Entropy¶