Problem Formulation¶

\(\def\abs#1{\left\lvert #1 \right\rvert} \def\Set#1{\left\{ #1 \right\}} \def\mc#1{\mathcal{#1}} \def\M#1{\boldsymbol{#1}} \def\R#1{\mathsf{#1}} \def\RM#1{\boldsymbol{\mathsf{#1}}} \def\op#1{\operatorname{#1}} \def\E{\op{E}} \def\d{\mathrm{\mathstrut d}}\)

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
SEED = 0

Mutual information estimation¶

How to formulate the problem of mutual information estimation?

The problem of estimating the mutual information is:

Definition 1 (MI Estimation)

Given \(n\) samples

\[(\R{X}_1,\R{Y}_1),\dots, (\R{X}_n,\R{Y}_n) \stackrel{iid}{\sim} P_{\R{X},\R{Y}}\in \mc{P}(\mc{X},\mc{Y})\]

i.i.d. drawn from an unknown probability measure \(P_{\R{X},\R{Y}}\) from the space \(\mc{X}\times \mc{Y}\), estimate the mutual information (MI)

(1)¶\[ \begin{align} I(\R{X}\wedge\R{Y}) &:= E\left[\log \frac{d P_{\R{X},\R{Y}}(\R{X},\R{Y})}{d (P_{\R{X}} \times P_{\R{Y}})(\R{X},\R{Y})} \right]. \end{align} \]

Run the following code, which uses numpy to

generate i.i.d. samples from a multivariate gaussian distribution, and
store the samples as numpy arrays assigned to XY.

# Seeded random number generator for reproducibility
XY_rng = np.random.default_rng(SEED)

# Sampling from an unknown probability measure
rho = 1 - 0.19 * XY_rng.random()
mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000
XY = XY_rng.multivariate_normal(mean, cov, n)
plt.scatter(XY[:, 0], XY[:, 1], s=2)
plt.show()

See multivariate_normal and scatter.

You can also get help directly in JupyterLab:

Docstring:
- Move the cursor to the object and
  - click Help->Show Contextual Help or
  - click Shift-Tab if you have limited screen space.

Directory:
- Right click on a notebook and choose New Console for Notebook.
- Run dir(obj) for a previously defined object obj to see the available methods/properties of obj.

Exercise

What is unknown about the above sampling distribution?

Solution

The density is

\[\begin{split} \frac{d P_{\R{X},\R{Y}}}{dxdy} = \mc{N}_{\M{0},\left[\begin{smallmatrix}1 & \rho \\ \rho & 1\end{smallmatrix}\right]}(x,y) \end{split}\]

but \(\rho\) is unknown (uniformly random over \([0.8,0.99)\)).

To show the data samples using pandas:

XY_df = pd.DataFrame(XY, columns=["X", "Y"])
XY_df

To plot the data using seaborn:

def plot_samples_with_kde(df, **kwargs):
    p = sns.PairGrid(df, **kwargs)
    p.map_lower(sns.scatterplot, s=2)  # scatter plot of samples
    p.map_upper(sns.kdeplot)  # kernel density estimate for pXY
    p.map_diag(sns.kdeplot)  # kde for pX and pY
    return p


plot_samples_with_kde(XY_df)
plt.show()

Exercise

Complete the following code by replacing the blanks ___ so that XY_ref stores the i.i.d. samples of \((\R{X}',\R{Y}')\) where \(\R{X}'\) and \(\R{Y}'\) are zero-mean independent gaussian random variables with unit variance.

...
cov_ref, n_ = ___, n
XY_ref = XY_ref_rng.___(mean, ___, n_)
...

XY_ref_rng = np.random.default_rng(SEED)
### BEGIN SOLUTION
cov_ref, n_ = [[1, 0], [0, 1]], n
XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_)
### END SOLUTION
XY_ref_df = pd.DataFrame(XY_ref, columns=["X'", "Y'"])
plot_samples_with_kde(XY_ref_df)
plt.show()

Divergence estimation¶

Can we generalize the problem further?

Estimating MI may be viewed as a special case of the following problem:

Definition 2 (Divergence estimation)

Estimate the KL divergence

(2)¶\[ \begin{align} D(P_{\R{Z}}\|P_{\R{Z}'}) &:= E\left[\log \frac{d P_{\R{Z}}(\R{Z})}{d P_{\R{Z}'}(\R{Z})} \right] \end{align} \]

using

a sequence \(\R{Z}^n:=(\R{Z}_1,\dots, \R{Z}_n)\sim P_{\R{Z}}^n\) of i.i.d. samples from \(P_{\R{Z}}\) if \(P_{\R{Z}}\) is unknown, and
another sequence \({\R{Z}'}^{n'}\sim P_{\R{Z}'}^{n'}\) of i.i.d. samples from \(P_{\R{Z}'}\) if \(P_{\R{Z}'}\), the reference measure of \(P_{\R{Z}}\), is also unknown.

Exercise

Although \(\R{X}^n\) and \(\R{Y}^n\) for MI estimation should have the same length, \(\R{Z}^n\) and \({\R{Z}'}^{n'}\) can have different lengths, i.e., \(n \not\equiv n'\). Why?

Solution

The dependency between \(\R{Z}\) and \(\R{Z}'\) does not affect the divergence.

Regarding the mutual information as a divergence from joint to product distributions, the problem can be further generalized to estimtate other divergences such as the \(f\)-divergence:

For a strictly convex function \(f\) with \(f(1)=0\),

(3)¶\[ \begin{align} D_f(P_{\R{Z}}\|P_{\R{Z}'}) &:= E\left[ f\left(\frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')}\right) \right]. \end{align} \]

\(f\)-divergence in (3) reduces to KL divergence when \(f=u \log u\):

\[ \begin{align} E\left[ \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \log \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \right] &= \int_{\mc{Z}} \color{gray}{d P_{\R{Z}'}(z)} \cdot \frac{d P_{\R{Z}}(z)}{\color{gray}{d P_{\R{Z}'}(z)}} \log \frac{d P_{\R{Z}}(z)}{d P_{\R{Z}'}(z)}. \end{align} \]

Exercise

Show that \(D_f(P_{\R{Z}}\|P_{\R{Z}'})\geq 0\) with equality iff \(P_{\R{Z}}=P_{\R{Z}'}\) using Jensen’s inequality and the properties of \(f\).

Solution

It is a valid divergence because, by Jensen’s inequality,

\[ D_f(P_{\R{Z}}\|P_{\R{Z}'}) \geq f\bigg( \underbrace{E\left[ \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \right]}_{=1}\bigg) = 0 \]

with equality iff \(P_{\R{Z}}=P_{\R{Z}'}\).

Regarding the divergence as an expectation, it is approximated by the sample average:

(4)¶\[ \begin{align} D_f(P_{\R{Z}}\|P_{\R{Z}'}) &\approx \frac1n \sum_{i\in [n]} f\left(\frac{d P_{\R{Z}}(\R{Z}'_i)}{d P_{\R{Z}'}(\R{Z}'_i)}\right). \end{align} \]

However, this is not a valid estimate because it involves the unknown measures \(P_{\R{Z}}\) and \(P_{\R{Z}'}\).

One may further estimate the density ratio

(5)¶\[ \begin{align} z \mapsto \frac{d P_{\R{Z}}(z)}{d P_{\R{Z}'}(z)} \end{align} \]

or estimate the density defined with respective to some reference measure \(\mu\):

(6)¶\[ \begin{align} p_{\R{Z}}&:=\frac{dP_{\R{Z}}}{d\mu} \in \mc{P}_{\mu}(\mc{Z}). \end{align} \]

Mutual Information in Machine Learning

Problem Formulation¶

Mutual information estimation¶

Divergence estimation¶