Skip to article frontmatterSkip to article content

Information Quantities for Decision Tree Induction

City University of Hong Kong
import dit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dit.shannon import conditional_entropy, entropy, mutual_information
from IPython.display import Math, display

%matplotlib widget

In this notebook, we will use the dit package to compute fundamental information quantities used in the decision tree induction. More basic implementations and interpretations are given here.

Entropy

Consider any (X,Y)(X, Y) taking values from X×Y\mathcal{X}\times \mathcal{Y} according to some joint probability measure PX,YP_{X,Y}.

To make sense of the expectations in Entropy, it is important to see that pY(Y)p_Y(Y) and pYX(YX)p_{Y|X}(Y|X) are random because

  • their arguments, namely XX and YY, are random
  • even though pYp_Y and pYXp_{Y|X} are deterministic functions.

For discrete random variable YY, pYp_Y and pYXp_{Y|X} in Entropy should be the probability mass functions:

H(Y)=yY:pY(y)>0pY(y)log1pY(y)H(YX)=E[yY:pYX(y)>0pYX(yX)log1pYX(yX)]\begin{align} H(Y) &= \sum_{y\in \mathcal{Y}:p_Y(y)>0} p_Y(y) \log \frac1{p_Y(y)}\\ H(Y|X) &= E\left[\sum_{y\in \mathcal{Y}:p_{Y|X}(y)>0} p_{Y|X}(y|X) \log \frac1{p_{Y|X}(y|X)}\right] \end{align}

where Y\mathcal{Y} is the alphabet set of YY. The last expectation is over the randomness of XX, which needs not be discrete.

It is convenient to express the entropy as a function of the probability masses as

h(p)=h(p1,p2,):=k:pk>0pklog1pkh(p) = h(p_1, p_2, \dots) := \sum_{k:p_k>0} p_k \log \frac1{p_k}

where pk[0,1]p_k\in [0,1] for all kk, kpk=1\sum_k p_k=1, and p:ipip:i\mapsto p_i.

Rewriting the entropies in (2) in terms of hh, we have

H(Y)=h(pY)H(YX)=E[h(pYX(X))].\begin{align} H(Y) &= h(p_Y)\\ H(Y|X) &= E[h(p_{Y|X}(\cdot|X))]. \end{align}

YOUR ANSWER HERE

To define and plot the distribution above:

p = dit.Distribution(["00", "01", "10", "11"], [1 / 2, 0, 1 / 4, 1 / 4])

plt.stem(p.outcomes, p.pmf)
plt.xlabel("k")
plt.ylabel(r"$p_k$")
plt.ylim((0, 1))
plt.show()

To compute the entropy of the distribution:

Math("h(p_{00}, p_{01}, p_{10}, p_{11})=" + f"{entropy(p):.3g}")

Mutual Information

All the information quantities defined so far can be concisely related using a Venn Diagram shown in Figure 1.

Venn diagram interpretation

Figure 1:Venn diagram interpretation of information measures

To define random variables in dit:

rv_names = "XY"
p.set_rv_names(rv_names)
p

We can now compute the entropies for different subsets of random variables:

for rvs in ["XY", "X", "Y", ""]:
    display(Math(f"H({','.join([rv for rv in rvs])})=" + f"{entropy(p, rvs):.3g}"))

To compute the conditional entropy:

Math("H(Y|X)=" + f"{conditional_entropy(p, 'Y', 'X'):.3g}")

To compute the mutual information:

Math("I(X;Y)=" + f"{mutual_information(p, 'X','Y'):.3g}")
# YOUR CODE HERE
raise NotImplementedError()
conditional_entropy_X_given_Y
# tests
assert (
    0
    <= conditional_entropy_X_given_Y
    <= min(entropy(p, "X"), entropy(p, "XY") - mutual_information(p, "X", "Y"))
)

Decision Tree Induction

Consider a dataset consisting of i.i.d. samples of some discrete random variables with an unknown joint distribution:

df = pd.read_csv("data.csv", dtype=str, skipinitialspace=True)
df

To estimate the information quantities of the features, we use the empirical distribution:

emp_p = dit.uniform([tuple(lst) for lst in df.to_numpy()])
emp_p.set_rv_names(df.columns)
emp_p

How to determine which attribute is more informative?

Info(D)\text{Info}(D) denotes the entropy H(Y)H(Y) of the empirical distribution of target YY.

InfoD = entropy(emp_p, "Y")
Math(r"\text{Info}(D)=" + f"{InfoD:.3g}")

InfoXi(D)\text{Info}_{X_i}(D) for different input feature XiX_i denotes the conditional entropy H(YXi)H(Y|X_i) of YY given XiX_i.

InfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
    InfoXD[cond] = conditional_entropy(emp_p, ["Y"], [cond])

Math(
    r"""
\begin{{aligned}}
\text{{Info}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Info}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Info}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Info}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **InfoXD
    )
)
GainXD = {}
# YOUR CODE HERE
raise NotImplementedError()

Math(
    r"""
\begin{{aligned}}
\text{{Gain}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Gain}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Gain}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Gain}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **GainXD
    )
)
# tests
assert np.isclose(GainXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainXD["X2"], 1, rtol=1e-3)

YOUR ANSWER HERE

To avoid the bias towards attributes with many outcomes, C4.5/J48 normalizes information gain by SplitInfoXi(D)\text{SplitInfo}_{X_i}(D), which can be calculated as H(Xi)H(X_i):

SplitInfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
    SplitInfoXD[cond] = entropy(emp_p, [cond])

Math(
    r"""
\begin{{aligned}}
\text{{SplitInfo}}_{{X_1}}(D)&={X1:.3g}\\
\text{{SplitInfo}}_{{X_2}}(D)&={X2:.3g}\\
\text{{SplitInfo}}_{{X_3}}(D)&={X3:.3g}\\
\text{{SplitInfo}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **SplitInfoXD
    )
)
GainRatioXD = {}
# YOUR CODE HERE
raise NotImplementedError()

Math(
    r"""
\begin{{aligned}}
\frac{{\text{{Gain}}_{{X_1}}(D)}}{{\text{{SplitInfo}}_{{X_1}}(D)}}&={X1:.3g}\\
\frac{{\text{{Gain}}_{{X_2}}(D)}}{{\text{{SplitInfo}}_{{X_2}}(D)}}&={X2:.3g}\\
\frac{{\text{{Gain}}_{{X_3}}(D)}}{{\text{{SplitInfo}}_{{X_3}}(D)}}&={X3:.3g}\\
\frac{{\text{{Gain}}_{{X_4}}(D)}}{{\text{{SplitInfo}}_{{X_4}}(D)}}&={X4:.3g}\\
\end{{aligned}}
""".format(
        **GainRatioXD
    )
)
# tests
assert np.isclose(GainRatioXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainRatioXD["X2"], 1, rtol=1e-3)

YOUR ANSWER HERE