Information Quantities for Probabilistic Classifiers Chung Chan
City University of Hong Kong
This notebook will introduce the information quantities often used for training probabilistic classifiers.
As an example, the following handwritten digit classifier is trained by deep learning using cross entropy loss:
Handwrite a digit from 0, ..., 9. Click predict to see if the app can recognize the digit. A fundamental property of mutual information is that:
To show this, we think of the mutual information as a statistical distance called the Kullback-Leibler divergence :
For the information divergence to be called a divergence , it has to satisfy the following property:
Without loss of generality, we can rewrite the divergence as the expectation:
D ( P Z ∥ P Z ′ ) : = E [ p Z ( Z ′ ) p Z ′ ( Z ′ ) log p Z ( Z ′ ) p Z ′ ( Z ′ ) ] ≥ E [ p Z ( Z ′ ) p Z ′ ( Z ′ ) ] log E [ p Z ( Z ′ ) p Z ′ ( Z ′ ) ] ⏟ = 1 = 0 , \begin{align}
D(P_{Z}\|P_{Z'}) &:= E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')} \log\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]\\
&\geq E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')}\right] \log \underbrace{E\left[\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]}_{=1} = 0,
\end{align} D ( P Z ∥ P Z ′ ) := E [ p Z ′ ( Z ′ ) p Z ( Z ′ ) log p Z ′ ( Z ′ ) p Z ( Z ′ ) ] ≥ E [ p Z ′ ( Z ′ ) p Z ( Z ′ ) ] log = 1 E [ p Z ′ ( Z ′ ) p Z ( Z ′ ) ] = 0 , where the last inequality follows from Jensen’s inequality and the convexity of r ↦ r log r r \mapsto r \log r r ↦ r log r . Since r r r is strictly convex, the inequality holds iff p Z ( Z ′ ) p Z ′ ( Z ′ ) = 1 \frac{p_{Z}(Z')}{p_{Z'}(Z')}=1 p Z ′ ( Z ′ ) p Z ( Z ′ ) = 1 almost surely, i.e., P Z = P Z ′ P_{Z}=P_{Z'} P Z = P Z ′ almost everywhere.
A probabilistic classifier returns a conditional probability estimate P ^ Y ∣ X \hat{P}_{Y|X} P ^ Y ∣ X as a function of the training data S S S , which consists of i.i.d. samples of ( X , Y ) (X,Y) ( X , Y ) but independent of ( X , Y ) (X,Y) ( X , Y ) .
A sensible choice of the loss function is
ℓ ( P ^ Y ∣ X ( ⋅ ∣ x ) , P Y ∣ X ( ⋅ ∣ x ) ) : = D ( P ^ Y ∣ X ( ⋅ ∣ x ) ∥ P Y ∣ X ( ⋅ ∣ x ) ) \ell(\hat{P}_{Y|X}(\cdot|x), P_{Y|X}(\cdot|x)):=D(\hat{P}_{Y|X}(\cdot|x)\|P_{Y|X}(\cdot|x)) ℓ ( P ^ Y ∣ X ( ⋅ ∣ x ) , P Y ∣ X ( ⋅ ∣ x )) := D ( P ^ Y ∣ X ( ⋅ ∣ x ) ∥ P Y ∣ X ( ⋅ ∣ x )) because, by the positivity of divergence (Lemma Lemma 1 ), the above loss is non-negative and equal to 0 iff P X × P ^ Y ∣ X = P X × P Y ∣ X P_X\times \hat{P}_{Y|X}=P_X \times P_{Y|X} P X × P ^ Y ∣ X = P X × P Y ∣ X almost surely. Using this loss function, we have a simple bias-variance trade-off:
Theorem 2 (Bias-variance trade-off)
The expected loss (risk) for the loss function (4) is
E [ D ( P ^ Y ∣ X ∥ P Y ∣ X ∣ P X ) ] = I ( S ; Y ^ ∣ X ) ⏞ Variance + D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) ⏞ Bias \begin{align}
E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)] = \overbrace{I(S;\hat{Y}|X)}^{\text{Variance}} + \overbrace{D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X)}^{\text{Bias}}
\end{align} E [ D ( P ^ Y ∣ X ∥ P Y ∣ X ∣ P X )] = I ( S ; Y ^ ∣ X ) Variance + D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) Bias where Y ^ \hat{Y} Y ^ is distributed according to
P X , Y , S , Y ^ = P X , Y × P S × P Y ^ ∣ X , S where P Y ^ ∣ X , S ( y ∣ x , s ) = P ^ Y ∣ X ( y ∣ x ) for ( x , y , s ) ∈ X × Y × S . \begin{align}
P_{X,Y,S,\hat{Y}}&=P_{X,Y}\times P_{S} \times P_{\hat{Y}|X,S} && \text{where}\\
P_{\hat{Y}|X,S}(y|x,s) &= \hat{P}_{Y|X}(y|x) && \text{for }(x,y,s)\in \mathcal{X}\times \mathcal{Y}\times \mathcal{S}.
\end{align} P X , Y , S , Y ^ P Y ^ ∣ X , S ( y ∣ x , s ) = P X , Y × P S × P Y ^ ∣ X , S = P ^ Y ∣ X ( y ∣ x ) where for ( x , y , s ) ∈ X × Y × S . The variance I ( S ; Y ^ ∣ X ) I(S;\hat{Y}|X) I ( S ; Y ^ ∣ X ) (also I ( S ; X , Y ^ ) I(S;X,\hat{Y}) I ( S ; X , Y ^ ) as I ( S ; X ) = 0 I(S;X)=0 I ( S ; X ) = 0 ) reflects the level of overfitting as it measures how much the estimate depends on the training data. The bias D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X) D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) reflects the level of underfitting as it measures how much the expected estimate E [ D ( P ^ Y ∣ X ∥ P Y ∣ X ∣ P X ) ] = E [ log p ^ Y ∣ X ( Y ^ ∣ X ) p Y ∣ X ( Y ^ ∣ X ) ] = E [ log p ^ Y ∣ X ( Y ^ ∣ X ) E [ p ^ Y ∣ X ] ( Y ^ ∣ X ) ] ⏟ (i) + E [ log E [ p ^ Y ∣ X ] ( Y ^ ∣ X ) ] p Y ∣ X ( Y ^ ∣ X ) ] ⏟ = D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) (bias) \begin{align}
E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)]
&= E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{p_{Y|X}(\hat{Y}|X)}\right]\\
&= \underbrace{E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{E[\hat{p}_{Y|X}](\hat{Y}|X)}\right]}_{\text{(i)}} + \underbrace{E\left[\log \frac{E[\hat{p}_{Y|X}](\hat{Y}|X)]}{p_{Y|X}(\hat{Y}|X)}\right]}_{=D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X) \text{(bias)}}
\end{align} E [ D ( P ^ Y ∣ X ∥ P Y ∣ X ∣ P X )] = E [ log p Y ∣ X ( Y ^ ∣ X ) p ^ Y ∣ X ( Y ^ ∣ X ) ] = (i) E [ log E [ p ^ Y ∣ X ] ( Y ^ ∣ X ) p ^ Y ∣ X ( Y ^ ∣ X ) ] + = D ( E [ P ^ Y ∣ X ] ∥ P Y ∣ X ∣ P X ) (bias) E [ log p Y ∣ X ( Y ^ ∣ X ) E [ p ^ Y ∣ X ] ( Y ^ ∣ X )] ] It remains to show (i) is the variance. By (6) ,
E [ p ^ Y ∣ X ] ( y ∣ x ) = E [ p Y ^ ∣ X , S ( y ∣ x , S ) ∣ X = x ] = p Y ^ ∣ X ( y ∣ x ) . \begin{align}
E[\hat{p}_{Y|X}](y|x)
&= E[p_{\hat{Y}|X,S}(y|x,S)|X=x]\\
&= p_{\hat{Y}|X}(y|x).
\end{align} E [ p ^ Y ∣ X ] ( y ∣ x ) = E [ p Y ^ ∣ X , S ( y ∣ x , S ) ∣ X = x ] = p Y ^ ∣ X ( y ∣ x ) . Substituting (6) and the above into (i), we have
(i) = E [ log p Y ^ ∣ X , S ( Y ^ ∣ X , S ) p Y ^ ∣ X ( Y ^ ∣ X ) ] = I ( S ; Y ^ ∣ X ) , \begin{align}
\text{(i)} &= E\left[\log \frac{p_{\hat{Y}|X,S}(\hat{Y}|X,S)}{p_{\hat{Y}|X}(\hat{Y}|X)}\right]\\
&= I(S;\hat{Y}|X),
\end{align} (i) = E [ log p Y ^ ∣ X ( Y ^ ∣ X ) p Y ^ ∣ X , S ( Y ^ ∣ X , S ) ] = I ( S ; Y ^ ∣ X ) , which completes the proof.
The loss in (4) , however, cannot be evaluated on S S S for training because P Y ∣ X ( ⋅ ∣ x i ) P_{Y|X}(\cdot|x_i) P Y ∣ X ( ⋅ ∣ x i ) is not known. Instead, we often use the cross entropy loss
ℓ ( P ^ Y ∣ X ( ⋅ ∣ x ) , y ) : = log 1 p ^ Y ∣ X ( y ∣ x ) . \ell(\hat{P}_{Y|X}(\cdot |x),y) := \log \frac{1}{\hat{p}_{Y|X}(y|x)}. ℓ ( P ^ Y ∣ X ( ⋅ ∣ x ) , y ) := log p ^ Y ∣ X ( y ∣ x ) 1 . Theorem 3 (Cross entropy)
The risk for the loss in (10) is
E [ log 1 p ^ Y ∣ X ( Y ∣ X ) ] = H ( Y ∣ X ) + E [ D ( P Y ∣ X ∥ P ^ Y ∣ X ∣ P X ) ] ≥ H ( Y ∣ X ) \begin{align}
E\left[\log \frac1{\hat{p}_{Y|X}(Y|X)}\right]
&= H(Y|X) + E[D(P_{Y|X}\| \hat{P}_{Y|X}|P_X)] \\
&\geq H(Y|X)
\end{align} E [ log p ^ Y ∣ X ( Y ∣ X ) 1 ] = H ( Y ∣ X ) + E [ D ( P Y ∣ X ∥ P ^ Y ∣ X ∣ P X )] ≥ H ( Y ∣ X ) with equality iff P X × P Y ∣ X = P X × P ^ Y ∣ X P_{X}\times P_{Y|X}=P_{X}\times \hat{P}_{Y|X} P X × P Y ∣ X = P X × P ^ Y ∣ X almost everywhere.