Content

Information Theory

Entropy:
- Set-theoretic structure of Shannon's information measure: I-measure
- Lattice-theoretic description of Shannon's information measure
- Information geometry: differential geometric structure of parameter space for probability distributions.
- Group-theoretic structure
- Graph-theoretic structure
Capacity of a channel
- It is also called "information capacity". It takes units of "bits per transmission", or "bits per second", or "bits/s/Hz".
  - It is different from another capacity called connection-carrying capacity, used in connection admission control.
- Operational definition: max achievable rate R for (2^{nR}, n) code with peak error probability approaching to zero as n goes to infinity.
- Information definition: max I(X;Y), where the maximization is over the distribution p(X).
- Theorem: max achievable rate = max I(X;Y)
- Capacity of a discrete channel, in unit of "bits per transmission".
  - C = max I(X;Y), where X and Y are input/output discrete random variables; I(X;Y) is mutual information; and the maximization is over the Probability Mass Function (PMF) of the input X.
  - Binary Symmetric Channel (BSC): C = 1 - H(p)
  - Binary Erasure Channel (BEC): C = 1 - p
- Capacity of a continuous channel, in unit of "bits per transmission".
  - C = max I(X;Y), where X and Y are input/output continuous random variables, and the maximization is over the Probability Density Function (PDF) of the input X.
  - Additive White Gaussian Noise (AWGN) channel (continuous-valued discrete-time channel)
    - C = 1/2 *log(1+SNR) for real-valued input
    - C = log(1+SNR) for complex-valued input
- Capacity of a band-limited AWGN channel (continuous-valued continuous-time channel or waveform channel), in unit of "bits per second".
  - Shannon's information capacity theorem (for real-valued waveform channel with AWGN):
    - C = B*log(1+SNR), where SNR=P/(B*N0).
- Capacity of a discrete-time discrete-valued channel with memory (refer to Gallager's book, pp. 97), or Finite state channel (FSC).
Capacity of a fading channel
- Ergodic capacity for a discrete-time i.i.d. fading channel, in unit of "bits per transmission".
  - C_erg = E[max I(X;Y|a)]
    where $a$ is the channel state (or channel gain); the maximization is over the conditional input PDF f(x|a); the expectation is over the PDF of the channel state $a$.
  - The above definition is applicable for both discrete-valued and continuous-valued X, Y and $a$. But the time has to be discrete.
- Outage capacity
Rate distortion function
- Operational definition: min achievable rate R for (2^{nR}, n) code with expected distortion <=D as n goes to infinity.
- Information definition: min I(X;hat{X}) s.t. expected distortion <=D, where the minimization is over the distribution p(\hat{X}|X).
Computational information theory:
- Blahut's iterative algorithm is used to calculate 1) channel capacity and 2) rate distortion function
- The key idea of Blahut's algorithm is similar to Lloyd algorithm.
  - Obtain KKT condition for min I(X;hat{X}) s.t. expected distortion <=D.
  - Decompose the KKT condition into two optimal conditions.
  - Use the two optimal conditions to design two calculation steps in each iteration.
  - Prove the iterative algorithm converges to the optimal solution.
- Blahut's iterative algorithm is a general optimization algorithm, suitable for nonlinear programming, integer programming, and mixed integer programming, i.e.,
  - If the optimization problem is a high-dimensional nonlinear programming problem, Blahut's algorithm is a 1-D iterative nonlinear programming algorithm. Note that we still need 1-D nonlinear programming algorithm to solve it.
  - If the optimization problem is a high-dimensional integer programming problem, Blahut's algorithm is a 1-D iterative integer programming algorithm. Note that we still need 1-D integer programming algorithm to solve it.
  - If the optimization problem is a high-dimensional mixed integer programming problem, Blahut's algorithm is a 1-D iterative mixed integer programming algorithm. Note that we still need 1-D mixed integer programming algorithm to solve it.
- Proof of convergence of the two Blahut algorithms
- The Blahut's iterative algorithm is based on the principle of divide and conquer, i.e., divide it into a smaller problem, solve it, and merge the solution. The following algorithm uses the same principle.
  - Gaussian-Seidel version of value iteration in dynamic programming:
    - For N-dimensional cost vector J with N initial states, in each iteration, calculate the new cost FJ for only one initial state (instead of parallel computation of the cost vector J for all the initial state).
  - Gaussian elimination method for solving a system of N linear equations:
    - In each of the N iteration, solve one unknown variable.
  - Gibbs sampler for generating N-dimensional random vectors:
    - Each iteration consists of N steps; in each step, generate a sample based on 1-D marginal distribution.
  - Lloyd algorithm:
    - Each iteration consists of two steps, i.e., 1) nearest neighbor condition and 2) centroid condition
  - EM algorithm:
    - Each iteration consists of two steps, i.e., E-step and M-step
  - Generalized EM algorithm:
    - Each iteration consists of two steps, i.e., generalized E-step (taking expectation of the lower bound of Q function) and generalized M-step (maximizing the resulting expectation of the lower bound) http://mplab.ucsd.edu/wordpress/tutorials/EM.pdf
  - Belief propagation: the sum product algorithm is a fast algorithm to compute p(X_i|Y_1^N).
  - BCJR for turbo iterative decoding
  - Viterbi algorithm
  - FFT algorithm
  - Simulation-based optimization
  - Reinforcement learning
  - Nonlinear system theory
  - Automaton theory
- Limitations of Blahut's algorithm
  - Its performance relies on 1-D iterative nonlinear/integer/mixed-integer programming algorithm. As we know, existing nonlinear programming algorithms converge to a local optimal solution rather than a global optimal solution (except for convex/concave objective functions); hence Blahut's algorithm cannot guarantee a global optimal solution for nonlinear programming problems. For integer programming problems, Blahut's algorithm may incur exponential complexity.
Inequalities:
- Jensen Inequality:
  - For convex function f(x), f(E(x))<=E(f(x)).
  - For concave function f(x), f(E(x))>=E(f(x)).
  - Almost all inequalities related to expectation (e.g., Holder inequality, Cauchy-Schwarz inequality) can be proved by Jensen Inequality.
- Holder inequality:
- Cauchy-Schwarz inequality:
- Mincowski inequality:
- Lyapounov inequality:
- Kraft inequality:
  - Intuition: the sum of probabilities of disjoint sets in a sample space should not be greater than 1. \sum_{x \in X} D^{-l(x)}<=1, where x is a letter, X is an alphabet, D denotes the number of bits in a coded symbol, l(x) is the length of the code corresponding to x.
- Fano inequality:
- exp(x)>1+x, for x>0. (Use Taylor expansion to prove.)
- exp(-x)>1-x, for x<0.
- ln(1+x)<x, for x>0. (Let f(x)=ln(1+x)-x. Take the derivative and get the proof.) Draw a figure to see this. f(x)=x increases faster than g(x)=ln(1+x).
- ln(1+x)<x, for -1<x<0. (Note the sign changes when taking the derivative.)
- ln(1-x)<-x, for 0<x<1.
- ln(1+K)<= \sum_{k=1}^K 1/k <= 1+ln(K).
MDL theory for Universal coding: looking at the mathematical property (stochastic complexity, parametric complexity, optimal performance bound, etc.), instead of actual encoding algorithm.
Entropy has three properties:
- Non-negativity
- Monotonic increase with the number of equally-distributed elementary events
- Additivity
  - The essense of the additivity of entropy is that partitioning increases entropy. Partitioning is like branching in a tree.

Key Techniques:

Asymptotic Equipartition Property (AEP): Basically it stems from ergodic theory and ergodic theorems such as Weak Law of Large Numbers. Use epsilon-N argument. Use the concept of typical sequence. AEP is a useful tool to prove theorems in information theory.
- Strong AEP
- Weak AEP
- AEP for iid processes
- AEP for ergodic processes (Shannon-McMillan-Breiman Theorem)
  - Strong Law of Large Numbers is not applicable (as for iid processes) since the stochastic sequence p(X_i|X_0^{i-1}) is not ergodic. Use the sandwich method (upper/low bound) to prove it.
Random coding argument: Generate a code book randomly. Due to the symmetric property of codewords, there is no need to compute the symbol error probability for each codeword. Instead, only considering one codeword is sufficient to obtain the symbol error probability.
Capacity for channels with additive white Gaussian noise:
- c=log(1+SNR), which is a concave function. For Low SNR, c is approximately equal to SNR. The approximation error gets larger as SNR gets larger. Note c<SNR for any SNR.
- Instantaneous channel capacity: c(t)=log(1+SNR(t)), where SNR(t) is the instantaneous SNR at time t, for flat fading channel.
- Instantaneous channel capacity density: c(t,f)=log(1+SNR(t,f)), where SNR(t,f) is the SNR for frequency f at time t, for frequency selective fading channel.

Information theory vs. queueing theory

The problem setting for Information theory is that 1) there is no queue in the system, and 2) the communication channel is noisy or error prone. In contrast, the problem setting for queueing theory is that 1) there is a queue in the system, and 2) the communication channel is noise-free or error free.
The problem, with which Information theory is concerned, is the ultimate bound for source coding and channel coding. In contrast, the problem that queueing theory addresses is queueing delay distribution or buffer overflow probability (or loss probability).
Constant arrival rate and constant departure rate make queueing theory not so useful; e.g., in constant bit rate wireless voice communication.
sharing
point-to-point communication; topology