# Problem Formulation¶

$$\def\abs#1{\left\lvert #1 \right\rvert} \def\Set#1{\left\{ #1 \right\}} \def\mc#1{\mathcal{#1}} \def\M#1{\boldsymbol{#1}} \def\R#1{\mathsf{#1}} \def\RM#1{\boldsymbol{\mathsf{#1}}} \def\op#1{\operatorname{#1}} \def\E{\op{E}} \def\d{\mathrm{\mathstrut d}}$$

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
SEED = 0

## Mutual information estimation¶

How to formulate the problem of mutual information estimation?

The problem of estimating the mutual information is:

Definition 1 (MI Estimation)

Given $$n$$ samples

$(\R{X}_1,\R{Y}_1),\dots, (\R{X}_n,\R{Y}_n) \stackrel{iid}{\sim} P_{\R{X},\R{Y}}\in \mc{P}(\mc{X},\mc{Y})$

i.i.d. drawn from an unknown probability measure $$P_{\R{X},\R{Y}}$$ from the space $$\mc{X}\times \mc{Y}$$, estimate the mutual information (MI)

(1)\begin{align} I(\R{X}\wedge\R{Y}) &:= E\left[\log \frac{d P_{\R{X},\R{Y}}(\R{X},\R{Y})}{d (P_{\R{X}} \times P_{\R{Y}})(\R{X},\R{Y})} \right]. \end{align}

Run the following code, which uses numpy to

• generate i.i.d. samples from a multivariate gaussian distribution, and

• store the samples as numpy arrays assigned to XY.

# Seeded random number generator for reproducibility
XY_rng = np.random.default_rng(SEED)

# Sampling from an unknown probability measure
rho = 1 - 0.19 * XY_rng.random()
mean, cov, n = [0, 0], [[1, rho], [rho, 1]], 1000
XY = XY_rng.multivariate_normal(mean, cov, n)
plt.scatter(XY[:, 0], XY[:, 1], s=2)
plt.show()

See multivariate_normal and scatter.

You can also get help directly in JupyterLab:

• Docstring:

• Move the cursor to the object and

• click Help->Show Contextual Help or

• click Shift-Tab if you have limited screen space.

• Directory:

• Right click on a notebook and choose New Console for Notebook.

• Run dir(obj) for a previously defined object obj to see the available methods/properties of obj.

Exercise

What is unknown about the above sampling distribution?

Solution

The density is

$\begin{split} \frac{d P_{\R{X},\R{Y}}}{dxdy} = \mc{N}_{\M{0},\left[\begin{smallmatrix}1 & \rho \\ \rho & 1\end{smallmatrix}\right]}(x,y) \end{split}$

but $$\rho$$ is unknown (uniformly random over $$[0.8,0.99)$$).

To show the data samples using pandas:

XY_df = pd.DataFrame(XY, columns=["X", "Y"])
XY_df

To plot the data using seaborn:

def plot_samples_with_kde(df, **kwargs):
p = sns.PairGrid(df, **kwargs)
p.map_lower(sns.scatterplot, s=2)  # scatter plot of samples
p.map_upper(sns.kdeplot)  # kernel density estimate for pXY
p.map_diag(sns.kdeplot)  # kde for pX and pY
return p

plot_samples_with_kde(XY_df)
plt.show()

Exercise

Complete the following code by replacing the blanks ___ so that XY_ref stores the i.i.d. samples of $$(\R{X}',\R{Y}')$$ where $$\R{X}'$$ and $$\R{Y}'$$ are zero-mean independent gaussian random variables with unit variance.

...
cov_ref, n_ = ___, n
XY_ref = XY_ref_rng.___(mean, ___, n_)
...
XY_ref_rng = np.random.default_rng(SEED)
### BEGIN SOLUTION
cov_ref, n_ = [[1, 0], [0, 1]], n
XY_ref = XY_ref_rng.multivariate_normal(mean, cov_ref, n_)
### END SOLUTION
XY_ref_df = pd.DataFrame(XY_ref, columns=["X'", "Y'"])
plot_samples_with_kde(XY_ref_df)
plt.show()

## Divergence estimation¶

Can we generalize the problem further?

Estimating MI may be viewed as a special case of the following problem:

Definition 2 (Divergence estimation)

Estimate the KL divergence

(2)\begin{align} D(P_{\R{Z}}\|P_{\R{Z}'}) &:= E\left[\log \frac{d P_{\R{Z}}(\R{Z})}{d P_{\R{Z}'}(\R{Z})} \right] \end{align}

using

• a sequence $$\R{Z}^n:=(\R{Z}_1,\dots, \R{Z}_n)\sim P_{\R{Z}}^n$$ of i.i.d. samples from $$P_{\R{Z}}$$ if $$P_{\R{Z}}$$ is unknown, and

• another sequence $${\R{Z}'}^{n'}\sim P_{\R{Z}'}^{n'}$$ of i.i.d. samples from $$P_{\R{Z}'}$$ if $$P_{\R{Z}'}$$, the reference measure of $$P_{\R{Z}}$$, is also unknown.

Exercise

Although $$\R{X}^n$$ and $$\R{Y}^n$$ for MI estimation should have the same length, $$\R{Z}^n$$ and $${\R{Z}'}^{n'}$$ can have different lengths, i.e., $$n \not\equiv n'$$. Why?

Solution

The dependency between $$\R{Z}$$ and $$\R{Z}'$$ does not affect the divergence.

Regarding the mutual information as a divergence from joint to product distributions, the problem can be further generalized to estimtate other divergences such as the $$f$$-divergence:

For a strictly convex function $$f$$ with $$f(1)=0$$,

(3)\begin{align} D_f(P_{\R{Z}}\|P_{\R{Z}'}) &:= E\left[ f\left(\frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')}\right) \right]. \end{align}

$$f$$-divergence in (3) reduces to KL divergence when $$f=u \log u$$:

\begin{align} E\left[ \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \log \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \right] &= \int_{\mc{Z}} \color{gray}{d P_{\R{Z}'}(z)} \cdot \frac{d P_{\R{Z}}(z)}{\color{gray}{d P_{\R{Z}'}(z)}} \log \frac{d P_{\R{Z}}(z)}{d P_{\R{Z}'}(z)}. \end{align}

Exercise

Show that $$D_f(P_{\R{Z}}\|P_{\R{Z}'})\geq 0$$ with equality iff $$P_{\R{Z}}=P_{\R{Z}'}$$ using Jensen’s inequality and the properties of $$f$$.

Solution

It is a valid divergence because, by Jensen’s inequality,

$D_f(P_{\R{Z}}\|P_{\R{Z}'}) \geq f\bigg( \underbrace{E\left[ \frac{d P_{\R{Z}}(\R{Z}')}{d P_{\R{Z}'}(\R{Z}')} \right]}_{=1}\bigg) = 0$

with equality iff $$P_{\R{Z}}=P_{\R{Z}'}$$.

Regarding the divergence as an expectation, it is approximated by the sample average:

(4)\begin{align} D_f(P_{\R{Z}}\|P_{\R{Z}'}) &\approx \frac1n \sum_{i\in [n]} f\left(\frac{d P_{\R{Z}}(\R{Z}'_i)}{d P_{\R{Z}'}(\R{Z}'_i)}\right). \end{align}

However, this is not a valid estimate because it involves the unknown measures $$P_{\R{Z}}$$ and $$P_{\R{Z}'}$$.

One may further estimate the density ratio

(5)\begin{align} z \mapsto \frac{d P_{\R{Z}}(z)}{d P_{\R{Z}'}(z)} \end{align}

or estimate the density defined with respective to some reference measure $$\mu$$:

(6)\begin{align} p_{\R{Z}}&:=\frac{dP_{\R{Z}}}{d\mu} \in \mc{P}_{\mu}(\mc{Z}). \end{align}