CS5483

import dit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dit.shannon import conditional_entropy, entropy, mutual_information
from IPython.display import Math, display

%matplotlib widget

In this notebook, we will use the dit package to compute fundamental information quantities used in the decision tree induction. More basic implementations and interpretations are given here.

Entropy¶

Consider any $(X, Y)$ taking values from $\mathcal{X}\times \mathcal{Y}$ according to some joint probability measure $P_{X,Y}$ .

Definition 1

Define

\begin{align} H(Y) &:= E\left[\log \tfrac1{p_{Y}(Y)}\right] && \text{called the entropy of $Y$}\\ H(Y|X) &:= E\left[\log \tfrac1{p_{Y|X}(Y|X)}\right] && \text{called conditional the entropy of $Y$ given $X$,} \end{align}

(1)

where $p_{Y|X}$ and $p_{Y}$ are the probability density functions of $Y$ with and without conditioning on $X$ respectively. The joint entropy $H(X,Y)$ of $X$ and $Y$ is defined as the entropy of the random tuple $(X,Y)$ using the joint density $p_{X,Y}$ .

To make sense of the expectations in Entropy, it is important to see that $p_Y(Y)$ and $p_{Y|X}(Y|X)$ are random because

their arguments, namely $X$ and $Y$ , are random
even though $p_Y$ and $p_{Y|X}$ are deterministic functions.

For discrete random variable $Y$ , $p_Y$ and $p_{Y|X}$ in Entropy should be the probability mass functions:

\begin{align} H(Y) &= \sum_{y\in \mathcal{Y}:p_Y(y)>0} p_Y(y) \log \frac1{p_Y(y)}\\ H(Y|X) &= E\left[\sum_{y\in \mathcal{Y}:p_{Y|X}(y)>0} p_{Y|X}(y|X) \log \frac1{p_{Y|X}(y|X)}\right] \end{align}

(2)

where $\mathcal{Y}$ is the alphabet set of $Y$ . The last expectation is over the randomness of $X$ , which needs not be discrete.

It is convenient to express the entropy as a function of the probability masses as

h(p) = h(p_1, p_2, \dots) := \sum_{k:p_k>0} p_k \log \frac1{p_k}

(3)

where $p_k\in [0,1]$ for all $k$ , $\sum_k p_k=1$ , and $p:i\mapsto p_i$ .

Rewriting the entropies in (2) in terms of $h$ , we have

\begin{align} H(Y) &= h(p_Y)\\ H(Y|X) &= E[h(p_{Y|X}(\cdot|X))]. \end{align}

(4)

YOUR ANSWER HERE

To define and plot the distribution above:

p = dit.Distribution(["00", "01", "10", "11"], [1 / 2, 0, 1 / 4, 1 / 4])

plt.stem(p.outcomes, p.pmf)
plt.xlabel("k")
plt.ylabel(r"$p_k$")
plt.ylim((0, 1))
plt.show()

To compute the entropy of the distribution:

Math("h(p_{00}, p_{01}, p_{10}, p_{11})=" + f"{entropy(p):.3g}")

Mutual Information¶

Definition 2

The mutual information between $X$ and $Y$ is defined as

\begin{align} I(X;Y) &:= E\left[\log \tfrac{p_{X,Y}(X,Y)}{p_X(X)p_Y(Y)}\right]\\ &= E\left[\log \tfrac{p_{Y|X}(Y|X)}{p_Y(Y)}\right]\\ &= E\left[\log \tfrac{p_{X|Y}(X|Y)}{p_X(X)}\right], \end{align}

(8)

where $p_{X,Y}$ and $p_{X}p_{Y}$ are the join density and the product of marginal density functions of $X$ and $Y$ . The conditional mutual information of $X$ and $Y$ given $Z$ is

\begin{align} I(X;Y) &:= E\left[\log \tfrac{p_{X,Y|Z}(X,Y|Z)}{p_{X|Z}(X|Z)p_{Y|Z}(Y|Z)}\right], \end{align}

(9)

which has the additional conditioning on $Z$ .

All the information quantities defined so far can be concisely related using a Venn Diagram shown in Figure 1.

Figure 1:Venn diagram interpretation of information measures

Theorem 1 (Chain rules)

The mutual information (8) can be expressed in terms of the entropies as

\begin{align} I(X;Y)&=H(Y)-H(Y|X)\\ &=H(X)+H(Y)-H(X,Y)\\ &=H(X)-H(X|Y). \end{align}

(10)

The entropy Entropy satisfies

\begin{align} H(X,Y)&=H(X)+H(Y|X)\\ &=H(Y)+H(X|Y), \end{align}

(11)

which is called the chain rule of entropy.

The conditional mutual information (9) satisfies

\begin{align} I(X;Y|Z)&=I(X,Z;Y) - I(X;Z) \\ &= I(X; Y, Z) - I(Y;Z), \end{align}

(12)

which is called the chain rule of mutual information.

Example 2

Rewrite the same distribution but using the random variables $X$ and $Y$ as

\begin{align} p_{XY}(x,y)=\begin{cases} \frac12 & (x,y)\in \{(0,0)\}\\ \frac14 & (x,y)\in \{(1,0), (1,1)\}\\ 0 & \text{otherwise.} \end{cases} \end{align}

(13)

To calculate the mutual information using the conditional entropy:

\begin{align} H(Y) &= h(p_Y(0), p_Y(1))\\ &= h(p_{00} + p_{10}, p_{01} + p_{11}) \\ &= h(\frac34, \frac14) \\ &\approx \underline{0.811}\\ H(Y|X) &= p_X(0) h(p_{Y|X}(0|0),p_{Y|X}(1|0)) + p_X(1) h(p_{Y|X}(0|1),p_{Y|X}(1|1))\\ &= (p_{00} + p_{01}) h\left(\frac{p_{00}}{p_{00} + p_{01}}, \frac{p_{01}}{p_{00} + p_{01}} \right) + (p_{10} + p_{11}) h\left(\frac{p_{10}}{p_{10} + p_{11}}, \frac{p_{11}}{p_{10} + p_{11}} \right)\\ &= \frac12 h\left(1, 0 \right) + \frac12 h\left(\frac12,\frac12 \right) \\ &= \underline{0.5} = 1.5-1 = H(X,Y) - H(X)\\ I(X;Y) &= H(Y)- H(Y|X) \\ &\approx 0.811 - 0.5\\ &= \underline{0.311} \end{align}

(14)

To verify the chain rule $H(X,Y)=H(X)+H(Y|X)$ :

\begin{align} H(X, Y) - H(Y|X) &= h(p_{00}, p_{01}, p_{10}, p_{11}) - 0.5 \\ &= \underline{1} = h\left(\frac12,\frac12\right) = H(X). \end{align}

(15)

To define random variables in dit:

rv_names = "XY"
p.set_rv_names(rv_names)
p

We can now compute the entropies for different subsets of random variables:

for rvs in ["XY", "X", "Y", ""]:
    display(Math(f"H({','.join([rv for rv in rvs])})=" + f"{entropy(p, rvs):.3g}"))

To compute the conditional entropy:

Math("H(Y|X)=" + f"{conditional_entropy(p, 'Y', 'X'):.3g}")

To compute the mutual information:

Math("I(X;Y)=" + f"{mutual_information(p, 'X','Y'):.3g}")

# YOUR CODE HERE
raise NotImplementedError()
conditional_entropy_X_given_Y

# tests
assert (
    0
    <= conditional_entropy_X_given_Y
    <= min(entropy(p, "X"), entropy(p, "XY") - mutual_information(p, "X", "Y"))
)

Decision Tree Induction¶

Consider a dataset consisting of i.i.d. samples of some discrete random variables with an unknown joint distribution:

df = pd.read_csv("data.csv", dtype=str, skipinitialspace=True)
df

To estimate the information quantities of the features, we use the empirical distribution:

emp_p = dit.uniform([tuple(lst) for lst in df.to_numpy()])
emp_p.set_rv_names(df.columns)
emp_p

How to determine which attribute is more informative?

$\text{Info}(D)$ denotes the entropy $H(Y)$ of the empirical distribution of target $Y$ .

InfoD = entropy(emp_p, "Y")
Math(r"\text{Info}(D)=" + f"{InfoD:.3g}")

$\text{Info}_{X_i}(D)$ for different input feature $X_i$ denotes the conditional entropy $H(Y|X_i)$ of $Y$ given $X_i$ .

InfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
    InfoXD[cond] = conditional_entropy(emp_p, ["Y"], [cond])

Math(
    r"""
\begin{{aligned}}
\text{{Info}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Info}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Info}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Info}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **InfoXD
    )
)

GainXD = {}
# YOUR CODE HERE
raise NotImplementedError()

Math(
    r"""
\begin{{aligned}}
\text{{Gain}}_{{X_1}}(D)&={X1:.3g}\\
\text{{Gain}}_{{X_2}}(D)&={X2:.3g}\\
\text{{Gain}}_{{X_3}}(D)&={X3:.3g}\\
\text{{Gain}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **GainXD
    )
)

# tests
assert np.isclose(GainXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainXD["X2"], 1, rtol=1e-3)

YOUR ANSWER HERE

To avoid the bias towards attributes with many outcomes, C4.5/J48 normalizes information gain by $\text{SplitInfo}_{X_i}(D)$ , which can be calculated as $H(X_i)$ :

SplitInfoXD = {}
for cond in ["X1", "X2", "X3", "X4"]:
    SplitInfoXD[cond] = entropy(emp_p, [cond])

Math(
    r"""
\begin{{aligned}}
\text{{SplitInfo}}_{{X_1}}(D)&={X1:.3g}\\
\text{{SplitInfo}}_{{X_2}}(D)&={X2:.3g}\\
\text{{SplitInfo}}_{{X_3}}(D)&={X3:.3g}\\
\text{{SplitInfo}}_{{X_4}}(D)&={X4:.3g}\\
\end{{aligned}}
""".format(
        **SplitInfoXD
    )
)

GainRatioXD = {}
# YOUR CODE HERE
raise NotImplementedError()

Math(
    r"""
\begin{{aligned}}
\frac{{\text{{Gain}}_{{X_1}}(D)}}{{\text{{SplitInfo}}_{{X_1}}(D)}}&={X1:.3g}\\
\frac{{\text{{Gain}}_{{X_2}}(D)}}{{\text{{SplitInfo}}_{{X_2}}(D)}}&={X2:.3g}\\
\frac{{\text{{Gain}}_{{X_3}}(D)}}{{\text{{SplitInfo}}_{{X_3}}(D)}}&={X3:.3g}\\
\frac{{\text{{Gain}}_{{X_4}}(D)}}{{\text{{SplitInfo}}_{{X_4}}(D)}}&={X4:.3g}\\
\end{{aligned}}
""".format(
        **GainRatioXD
    )
)

# tests
assert np.isclose(GainRatioXD["X1"], 0.5, rtol=1e-3)
assert np.isclose(GainRatioXD["X2"], 1, rtol=1e-3)

YOUR ANSWER HERE

Information Quantities for Decision Tree Induction

Entropy¶

Mutual Information¶

Decision Tree Induction¶