Information Quantities for Decision Tree Induction Chung Chan
City University of Hong Kong
import dit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dit.shannon import conditional_entropy, entropy, mutual_information
from IPython.display import Math, display
%matplotlib widget
In this notebook, we will use the dit package to compute fundamental information quantities used in the decision tree induction. More basic implementations and interpretations are given here .
Consider any ( X , Y ) (X, Y) ( X , Y ) taking values from X × Y \mathcal{X}\times \mathcal{Y} X × Y according to some joint probability measure P X , Y P_{X,Y} P X , Y .
Define
H ( Y ) : = E [ log 1 p Y ( Y ) ] called the entropy of Y H ( Y ∣ X ) : = E [ log 1 p Y ∣ X ( Y ∣ X ) ] called conditional the entropy of Y given X , \begin{align}
H(Y) &:= E\left[\log \tfrac1{p_{Y}(Y)}\right] && \text{called the entropy of $Y$}\\
H(Y|X) &:= E\left[\log \tfrac1{p_{Y|X}(Y|X)}\right] && \text{called conditional the entropy of $Y$ given $X$,}
\end{align} H ( Y ) H ( Y ∣ X ) := E [ log p Y ( Y ) 1 ] := E [ log p Y ∣ X ( Y ∣ X ) 1 ] called the entropy of Y called conditional the entropy of Y given X , where p Y ∣ X p_{Y|X} p Y ∣ X and p Y p_{Y} p Y are the probability density functions of Y Y Y with and without conditioning on X X X respectively. The joint entropy H ( X , Y ) H(X,Y) H ( X , Y ) of X X X and Y Y Y is defined as the entropy of the random tuple ( X , Y ) (X,Y) ( X , Y ) using the joint density p X , Y p_{X,Y} p X , Y .
Unless otherwise specified, all the logarithm is base 2, in which case the information quantities are in the unit of bit (binary digit). A popular alternative is to use the natural logarithm, log = ln \log = \ln log = ln , in which case the unit is in nat . To make sense of the expectations in Entropy , it is important to see that p Y ( Y ) p_Y(Y) p Y ( Y ) and p Y ∣ X ( Y ∣ X ) p_{Y|X}(Y|X) p Y ∣ X ( Y ∣ X ) are random because
their arguments, namely X X X and Y Y Y , are random even though p Y p_Y p Y and p Y ∣ X p_{Y|X} p Y ∣ X are deterministic functions. For discrete random variable Y Y Y , p Y p_Y p Y and p Y ∣ X p_{Y|X} p Y ∣ X in Entropy should be the probability mass functions:
H ( Y ) = ∑ y ∈ Y : p Y ( y ) > 0 p Y ( y ) log 1 p Y ( y ) H ( Y ∣ X ) = E [ ∑ y ∈ Y : p Y ∣ X ( y ) > 0 p Y ∣ X ( y ∣ X ) log 1 p Y ∣ X ( y ∣ X ) ] \begin{align}
H(Y) &= \sum_{y\in \mathcal{Y}:p_Y(y)>0} p_Y(y) \log \frac1{p_Y(y)}\\
H(Y|X) &= E\left[\sum_{y\in \mathcal{Y}:p_{Y|X}(y)>0} p_{Y|X}(y|X) \log \frac1{p_{Y|X}(y|X)}\right]
\end{align} H ( Y ) H ( Y ∣ X ) = y ∈ Y : p Y ( y ) > 0 ∑ p Y ( y ) log p Y ( y ) 1 = E ⎣ ⎡ y ∈ Y : p Y ∣ X ( y ) > 0 ∑ p Y ∣ X ( y ∣ X ) log p Y ∣ X ( y ∣ X ) 1 ⎦ ⎤ where Y \mathcal{Y} Y is the alphabet set of Y Y Y . The last expectation is over the randomness of X X X , which needs not be discrete.
It is convenient to express the entropy as a function of the probability masses as
h ( p ) = h ( p 1 , p 2 , … ) : = ∑ k : p k > 0 p k log 1 p k h(p) = h(p_1, p_2, \dots) := \sum_{k:p_k>0} p_k \log \frac1{p_k} h ( p ) = h ( p 1 , p 2 , … ) := k : p k > 0 ∑ p k log p k 1 where p k ∈ [ 0 , 1 ] p_k\in [0,1] p k ∈ [ 0 , 1 ] for all k k k , ∑ k p k = 1 \sum_k p_k=1 ∑ k p k = 1 , and p : i ↦ p i p:i\mapsto p_i p : i ↦ p i .
Rewriting the entropies in (2) in terms of h h h , we have
H ( Y ) = h ( p Y ) H ( Y ∣ X ) = E [ h ( p Y ∣ X ( ⋅ ∣ X ) ) ] . \begin{align}
H(Y) &= h(p_Y)\\
H(Y|X) &= E[h(p_{Y|X}(\cdot|X))].
\end{align} H ( Y ) H ( Y ∣ X ) = h ( p Y ) = E [ h ( p Y ∣ X ( ⋅ ∣ X ))] . For discrete random variables Y Y Y , explain whether H ( Y ) H(Y) H ( Y ) is non-negative and scale-invariant in the sense that
0 ≤ H ( Y ) = H ( c Y ) ∀ c ≠ 0 , 0\leq H(Y) = H(c Y) \quad \forall c\neq 0, 0 ≤ H ( Y ) = H ( c Y ) ∀ c = 0 , using the above formulae in (4) .
To define and plot the distribution above:
p = dit.Distribution(["00", "01", "10", "11"], [1 / 2, 0, 1 / 4, 1 / 4])
plt.stem(p.outcomes, p.pmf)
plt.xlabel("k")
plt.ylabel(r"$p_k$")
plt.ylim((0, 1))
plt.show()
To compute the entropy of the distribution:
Math("h(p_{00}, p_{01}, p_{10}, p_{11})=" + f"{entropy(p):.3g}")
The mutual information between X X X and Y Y Y is defined as
I ( X ; Y ) : = E [ log p X , Y ( X , Y ) p X ( X ) p Y ( Y ) ] = E [ log p Y ∣ X ( Y ∣ X ) p Y ( Y ) ] = E [ log p X ∣ Y ( X ∣ Y ) p X ( X ) ] , \begin{align}
I(X;Y)
&:= E\left[\log \tfrac{p_{X,Y}(X,Y)}{p_X(X)p_Y(Y)}\right]\\
&= E\left[\log \tfrac{p_{Y|X}(Y|X)}{p_Y(Y)}\right]\\
&= E\left[\log \tfrac{p_{X|Y}(X|Y)}{p_X(X)}\right],
\end{align} I ( X ; Y ) := E [ log p X ( X ) p Y ( Y ) p X , Y ( X , Y ) ] = E [ log p Y ( Y ) p Y ∣ X ( Y ∣ X ) ] = E [ log p X ( X ) p X ∣ Y ( X ∣ Y ) ] , where p X , Y p_{X,Y} p X , Y and p X p Y p_{X}p_{Y} p X p Y are the join density and the product of marginal density functions of X X X and Y Y Y . The conditional mutual information of X X X and Y Y Y given Z Z Z is
I ( X ; Y ) : = E [ log p X , Y ∣ Z ( X , Y ∣ Z ) p X ∣ Z ( X ∣ Z ) p Y ∣ Z ( Y ∣ Z ) ] , \begin{align}
I(X;Y)
&:= E\left[\log \tfrac{p_{X,Y|Z}(X,Y|Z)}{p_{X|Z}(X|Z)p_{Y|Z}(Y|Z)}\right],
\end{align} I ( X ; Y ) := E [ log p X ∣ Z ( X ∣ Z ) p Y ∣ Z ( Y ∣ Z ) p X , Y ∣ Z ( X , Y ∣ Z ) ] , which has the additional conditioning on Z Z Z .
All the information quantities defined so far can be concisely related using a Venn Diagram shown in Figure 1 .
Figure 1: Venn diagram interpretation of information measures
The mutual information (8) can be expressed in terms of the entropies as
I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( X ) + H ( Y ) − H ( X , Y ) = H ( X ) − H ( X ∣ Y ) . \begin{align}
I(X;Y)&=H(Y)-H(Y|X)\\
&=H(X)+H(Y)-H(X,Y)\\
&=H(X)-H(X|Y).
\end{align} I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) = H ( X ) + H ( Y ) − H ( X , Y ) = H ( X ) − H ( X ∣ Y ) . The entropy Entropy satisfies
H ( X , Y ) = H ( X ) + H ( Y ∣ X ) = H ( Y ) + H ( X ∣ Y ) , \begin{align}
H(X,Y)&=H(X)+H(Y|X)\\
&=H(Y)+H(X|Y),
\end{align} H ( X , Y ) = H ( X ) + H ( Y ∣ X ) = H ( Y ) + H ( X ∣ Y ) , which is called the chain rule of entropy.
The conditional mutual information (9) satisfies
I ( X ; Y ∣ Z ) = I ( X , Z ; Y ) − I ( X ; Z ) = I ( X ; Y , Z ) − I ( Y ; Z ) , \begin{align}
I(X;Y|Z)&=I(X,Z;Y) - I(X;Z) \\
&= I(X; Y, Z) - I(Y;Z),
\end{align} I ( X ; Y ∣ Z ) = I ( X , Z ; Y ) − I ( X ; Z ) = I ( X ; Y , Z ) − I ( Y ; Z ) , which is called the chain rule of mutual information.
Rewrite the same distribution but using the random variables X X X and Y Y Y as
p X Y ( x , y ) = { 1 2 ( x , y ) ∈ { ( 0 , 0 ) } 1 4 ( x , y ) ∈ { ( 1 , 0 ) , ( 1 , 1 ) } 0 otherwise. \begin{align}
p_{XY}(x,y)=\begin{cases}
\frac12 & (x,y)\in \{(0,0)\}\\
\frac14 & (x,y)\in \{(1,0), (1,1)\}\\
0 & \text{otherwise.}
\end{cases}
\end{align} p X Y ( x , y ) = ⎩ ⎨ ⎧ 2 1 4 1 0 ( x , y ) ∈ {( 0 , 0 )} ( x , y ) ∈ {( 1 , 0 ) , ( 1 , 1 )} otherwise. To calculate the mutual information using the conditional entropy:
H ( Y ) = h ( p Y ( 0 ) , p Y ( 1 ) ) = h ( p 00 + p 10 , p 01 + p 11 ) = h ( 3 4 , 1 4 ) ≈ 0.811 ‾ H ( Y ∣ X ) = p X ( 0 ) h ( p Y ∣ X ( 0 ∣ 0 ) , p Y ∣ X ( 1 ∣ 0 ) ) + p X ( 1 ) h ( p Y ∣ X ( 0 ∣ 1 ) , p Y ∣ X ( 1 ∣ 1 ) ) = ( p 00 + p 01 ) h ( p 00 p 00 + p 01 , p 01 p 00 + p 01 ) + ( p 10 + p 11 ) h ( p 10 p 10 + p 11 , p 11 p 10 + p 11 ) = 1 2 h ( 1 , 0 ) + 1 2 h ( 1 2 , 1 2 ) = 0.5 ‾ = 1.5 − 1 = H ( X , Y ) − H ( X ) I ( X ; Y ) = H ( Y ) − H ( Y ∣ X ) ≈ 0.811 − 0.5 = 0.311 ‾ \begin{align}
H(Y) &= h(p_Y(0), p_Y(1))\\
&= h(p_{00} + p_{10}, p_{01} + p_{11}) \\
&= h(\frac34, \frac14) \\
&\approx \underline{0.811}\\
H(Y|X) &= p_X(0) h(p_{Y|X}(0|0),p_{Y|X}(1|0)) + p_X(1) h(p_{Y|X}(0|1),p_{Y|X}(1|1))\\
&= (p_{00} + p_{01}) h\left(\frac{p_{00}}{p_{00} + p_{01}}, \frac{p_{01}}{p_{00} + p_{01}} \right) + (p_{10} + p_{11}) h\left(\frac{p_{10}}{p_{10} + p_{11}}, \frac{p_{11}}{p_{10} + p_{11}} \right)\\
&= \frac12 h\left(1, 0 \right) + \frac12 h\left(\frac12,\frac12 \right) \\
&= \underline{0.5} = 1.5-1 = H(X,Y) - H(X)\\
I(X;Y) &= H(Y)- H(Y|X) \\
&\approx 0.811 - 0.5\\
&= \underline{0.311}
\end{align} H ( Y ) H ( Y ∣ X ) I ( X ; Y ) = h ( p Y ( 0 ) , p Y ( 1 )) = h ( p 00 + p 10 , p 01 + p 11 ) = h ( 4 3 , 4 1 ) ≈ 0.811 = p X ( 0 ) h ( p Y ∣ X ( 0∣0 ) , p Y ∣ X ( 1∣0 )) + p X ( 1 ) h ( p Y ∣ X ( 0∣1 ) , p Y ∣ X ( 1∣1 )) = ( p 00 + p 01 ) h ( p 00 + p 01 p 00 , p 00 + p 01 p 01 ) + ( p 10 + p 11 ) h ( p 10 + p 11 p 10 , p 10 + p 11 p 11 ) = 2 1 h ( 1 , 0 ) + 2 1 h ( 2 1 , 2 1 ) = 0.5 = 1.5 − 1 = H ( X , Y ) − H ( X ) = H ( Y ) − H ( Y ∣ X ) ≈ 0.811 − 0.5 = 0.311 To verify the chain rule H ( X , Y ) = H ( X ) + H ( Y ∣ X ) H(X,Y)=H(X)+H(Y|X) H ( X , Y ) = H ( X ) + H ( Y ∣ X ) :
H ( X , Y ) − H ( Y ∣ X ) = h ( p 00 , p 01 , p 10 , p 11 ) − 0.5 = 1 ‾ = h ( 1 2 , 1 2 ) = H ( X ) . \begin{align}
H(X, Y) - H(Y|X) &= h(p_{00}, p_{01}, p_{10}, p_{11}) - 0.5 \\
&= \underline{1} = h\left(\frac12,\frac12\right) = H(X).
\end{align} H ( X , Y ) − H ( Y ∣ X ) = h ( p 00 , p 01 , p 10 , p 11 ) − 0.5 = 1 = h ( 2 1 , 2 1 ) = H ( X ) . To define random variables in dit:
rv_names = "XY"
p.set_rv_names(rv_names)
p
We can now compute the entropies for different subsets of random variables:
for rvs in ["XY", "X", "Y", ""]:
display(Math(f"H({','.join([rv for rv in rvs])})=" + f"{entropy(p, rvs):.3g}"))
To compute the conditional entropy:
Math("H(Y|X)=" + f"{conditional_entropy(p, 'Y', 'X'):.3g}")
To compute the mutual information:
Math("I(X;Y)=" + f"{mutual_information(p, 'X','Y'):.3g}")
# YOUR CODE HERE
raise NotImplementedError()
conditional_entropy_X_given_Y
# tests
assert (
0
<= conditional_entropy_X_given_Y
<= min(entropy(p, "X"), entropy(p, "XY") - mutual_information(p, "X", "Y"))
)
Consider a dataset consisting of i.i.d. samples of some discrete random variables with an unknown joint distribution:
df = pd.read_csv("data.csv", dtype=str, skipinitialspace=True)
df
To estimate the information quantities of the features, we use the empirical distribution:
emp_p = dit.uniform([tuple(lst) for lst in df.to_numpy()])
emp_p.set_rv_names(df.columns)
emp_p
How to determine which attribute is more informative?
Info ( D ) \text{Info}(D) Info ( D ) denotes the entropy H ( Y ) H(Y) H ( Y ) of the empirical distribution of target Y Y Y .
InfoD = entropy(emp_p, "Y")
Math(r"\text{Info}(D)=" + f"{InfoD:.3g}")
Info X i ( D ) \text{Info}_{X_i}(D) Info X i ( D ) for different input feature X i X_i X i denotes the conditional entropy H ( Y ∣ X i ) H(Y|X_i) H ( Y ∣ X i ) of Y Y Y given X i X_i X i .
InfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
InfoXD[cond] = conditional_entropy(emp_p, ["Y"], [cond])
Math(
r"""
\begin{{aligned}}
\text{{Info}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Info}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Info}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Info}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
**InfoXD
)
)
GainXD = {}
# YOUR CODE HERE
raise NotImplementedError()
Math(
r"""
\begin{{aligned}}
\text{{Gain}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Gain}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Gain}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Gain}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
**GainXD
)
)
# tests
assert np.isclose(GainXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainXD["X2"], 1, rtol=1e-3)
To avoid the bias towards attributes with many outcomes, C4.5/J48 normalizes information gain by SplitInfo X i ( D ) \text{SplitInfo}_{X_i}(D) SplitInfo X i ( D ) , which can be calculated as H ( X i ) H(X_i) H ( X i ) :
SplitInfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
SplitInfoXD[cond] = entropy(emp_p, [cond])
Math(
r"""
\begin{{aligned}}
\text{{SplitInfo}}_{{X_1}}(D)&={X1:.3g}\\
\text{{SplitInfo}}_{{X_2}}(D)&={X2:.3g}\\
\text{{SplitInfo}}_{{X_3}}(D)&={X3:.3g}\\
\text{{SplitInfo}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
**SplitInfoXD
)
)
GainRatioXD = {}
# YOUR CODE HERE
raise NotImplementedError()
Math(
r"""
\begin{{aligned}}
\frac{{\text{{Gain}}_{{X_1}}(D)}}{{\text{{SplitInfo}}_{{X_1}}(D)}}&={X1:.3g}\\
\frac{{\text{{Gain}}_{{X_2}}(D)}}{{\text{{SplitInfo}}_{{X_2}}(D)}}&={X2:.3g}\\
\frac{{\text{{Gain}}_{{X_3}}(D)}}{{\text{{SplitInfo}}_{{X_3}}(D)}}&={X3:.3g}\\
\frac{{\text{{Gain}}_{{X_4}}(D)}}{{\text{{SplitInfo}}_{{X_4}}(D)}}&={X4:.3g}\\
\end{{aligned}}
""".format(
**GainRatioXD
)
)
# tests
assert np.isclose(GainRatioXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainRatioXD["X2"], 1, rtol=1e-3)