Information
Theory
- Entropy:
- Set-theoretic structure of Shannon's
information measure: I-measure
- Lattice-theoretic description of Shannon's
information measure
- Information geometry: differential geometric
structure of parameter space for probability distributions.
- Group-theoretic structure
- Graph-theoretic structure
- Capacity of a channel
- It is also called "information
capacity". It takes units of "bits per
transmission", or "bits per second", or
"bits/s/Hz".
- It is different from another capacity
called connection-carrying capacity, used in connection admission
control.
- Operational definition: max achievable
rate R for (2^{nR}, n) code with peak error probability approaching to
zero as n goes to infinity.
- Information definition: max I(X;Y), where
the maximization is over the distribution p(X).
- Theorem: max achievable rate = max I(X;Y)
- Capacity of a discrete channel, in unit
of "bits per transmission".
- C = max I(X;Y), where X and Y are
input/output discrete random variables; I(X;Y) is mutual
information; and the maximization is over the Probability Mass
Function (PMF) of the input X.
- Binary Symmetric Channel (BSC):
C = 1 - H(p)
- Binary Erasure Channel (BEC):
C = 1 - p
- Capacity of a continuous channel, in unit
of "bits per transmission".
- C = max I(X;Y), where X and Y are
input/output continuous random variables, and the maximization is
over the Probability Density Function (PDF) of the input X.
- Additive White Gaussian Noise (AWGN)
channel (continuous-valued discrete-time channel)
- C = 1/2 *log(1+SNR) for
real-valued input
- C = log(1+SNR) for complex-valued
input
- Capacity of a band-limited AWGN channel
(continuous-valued continuous-time channel or waveform channel), in unit
of "bits per second".
- Shannon's information capacity
theorem (for real-valued waveform channel with AWGN):
- C = B*log(1+SNR), where SNR=P/(B*N0).
- Capacity of a discrete-time
discrete-valued channel with memory (refer to Gallager's book, pp. 97),
or Finite state channel (FSC).
- Capacity of a fading channel
- Ergodic capacity for a discrete-time
i.i.d. fading channel, in unit of "bits per
transmission".
- Outage capacity
- Rate distortion function
- Operational definition: min achievable rate R
for (2^{nR}, n) code with expected distortion <=D as n goes to infinity.
- Information definition: min I(X;hat{X}) s.t.
expected distortion <=D,
where the minimization is over the distribution p(\hat{X}|X).
- Computational information theory:
- Blahut's iterative algorithm is used to
calculate 1) channel capacity and 2) rate distortion function
- The key idea of Blahut's algorithm is similar
to Lloyd algorithm.
- Obtain KKT condition for min I(X;hat{X}) s.t.
expected distortion <=D.
- Decompose the KKT condition into two
optimal conditions.
- Use the two optimal conditions to design two
calculation steps in each iteration.
- Prove the iterative algorithm converges to
the optimal solution.
- Blahut's iterative algorithm is a general
optimization algorithm, suitable for nonlinear programming, integer
programming, and mixed integer programming, i.e.,
- If the optimization problem is a
high-dimensional nonlinear programming problem, Blahut's algorithm is a 1-D
iterative nonlinear programming algorithm. Note that we still need 1-D
nonlinear programming algorithm to solve it.
- If the optimization problem is a
high-dimensional integer programming problem, Blahut's algorithm is a 1-D
iterative integer programming algorithm. Note that we still need 1-D integer
programming algorithm to solve it.
- If the optimization problem is a
high-dimensional mixed integer programming problem, Blahut's algorithm is a
1-D iterative mixed integer programming algorithm. Note that we still need 1-D
mixed integer programming algorithm to solve it.
- Proof of convergence of the two Blahut
algorithms
- The Blahut's iterative algorithm is based on
the principle of divide and conquer, i.e., divide it into a smaller problem,
solve it, and merge the solution. The following algorithm uses the
same principle.
- Gaussian-Seidel version of value iteration in
dynamic programming:
- For N-dimensional cost vector J with N
initial states, in each iteration, calculate the new cost FJ for only one
initial state (instead of parallel computation of the cost vector J for all
the initial state).
- Gaussian elimination method for solving a
system of N linear equations:
- In each of the N iteration, solve one unknown
variable.
- Gibbs sampler for generating N-dimensional
random vectors:
- Each iteration consists of N steps; in each
step, generate a sample based on 1-D marginal distribution.
- Lloyd algorithm:
- Each iteration consists of two steps, i.e.,
1) nearest neighbor condition and 2) centroid condition
- EM algorithm:
- Each iteration consists of two steps, i.e.,
E-step and M-step
- Generalized EM algorithm:
- Each iteration consists of two steps, i.e.,
generalized E-step (taking expectation of the lower bound of Q function) and
generalized M-step (maximizing the resulting expectation of the lower bound)
http://mplab.ucsd.edu/wordpress/tutorials/EM.pdf
- Belief propagation: the sum product algorithm
is a fast algorithm to compute p(X_i|Y_1^N).
- BCJR for turbo iterative decoding
- Viterbi algorithm
- FFT algorithm
- Simulation-based optimization
- Reinforcement learning
- Nonlinear system theory
- Automaton theory
- Limitations of Blahut's algorithm
- Its performance relies on 1-D iterative
nonlinear/integer/mixed-integer programming algorithm. As we know, existing
nonlinear programming algorithms converge to a local optimal solution rather
than a global optimal solution (except for convex/concave objective
functions); hence Blahut's algorithm cannot guarantee a global optimal
solution for nonlinear programming problems. For integer programming problems,
Blahut's algorithm may incur exponential complexity.
- Inequalities:
- Jensen Inequality:
- For convex function f(x), f(E(x))<=E(f(x)).
- For concave function f(x), f(E(x))>=E(f(x)).
- Almost all inequalities related to
expectation (e.g., Holder inequality, Cauchy-Schwarz inequality) can
be proved by Jensen Inequality.
- Holder inequality:
- Cauchy-Schwarz inequality:
- Mincowski inequality:
- Lyapounov inequality:
- Kraft inequality:
- Intuition: the sum of probabilities of
disjoint sets in a sample space should not be greater than 1.
\sum_{x \in X} D^{-l(x)}<=1, where x is a letter, X is an alphabet, D
denotes the number of bits in a coded symbol, l(x) is the length of the
code corresponding to x.
- Fano inequality:
- exp(x)>1+x, for x>0. (Use Taylor
expansion to prove.)
- exp(-x)>1-x, for x<0.
- ln(1+x)<x, for x>0. (Let f(x)=ln(1+x)-x.
Take the derivative and get the proof.) Draw a figure to see
this. f(x)=x increases faster than g(x)=ln(1+x).
- ln(1+x)<x, for -1<x<0. (Note the
sign changes when taking the derivative.)
- ln(1-x)<-x, for 0<x<1.
- ln(1+K)<= \sum_{k=1}^K 1/k <=
1+ln(K).
- MDL theory for Universal coding: looking at
the mathematical property (stochastic complexity, parametric complexity,
optimal performance bound, etc.), instead of actual encoding algorithm.
- Entropy has three properties:
- Non-negativity
- Monotonic increase with the number of
equally-distributed elementary events
- Additivity
- The essense of the additivity of entropy is
that partitioning increases entropy. Partitioning is like branching in a
tree.
Key Techniques:
-
Asymptotic Equipartition Property (AEP): Basically
it stems from ergodic theory and ergodic theorems such as Weak Law of Large Numbers. Use
epsilon-N argument. Use the concept of typical sequence. AEP
is a useful tool to prove theorems in information theory.
- Strong AEP
- Weak AEP
- AEP for iid processes
- AEP for ergodic processes (Shannon-McMillan-Breiman
Theorem)
- Strong Law of Large Numbers is not applicable
(as for iid processes) since the stochastic sequence p(X_i|X_0^{i-1}) is not
ergodic. Use the sandwich method (upper/low bound) to prove it.
-
Random coding argument: Generate a code book
randomly. Due to the symmetric property of codewords, there is no need to
compute the symbol error probability for each codeword. Instead, only
considering one codeword is sufficient to obtain the symbol error probability.
-
Capacity for channels with additive white Gaussian
noise:
- c=log(1+SNR), which is a concave
function. For Low SNR, c is approximately equal to SNR.
The approximation error gets larger as SNR gets larger. Note c<SNR
for any SNR.
- Instantaneous channel capacity: c(t)=log(1+SNR(t)),
where SNR(t) is the instantaneous SNR at time t, for flat fading channel.
- Instantaneous channel capacity density: c(t,f)=log(1+SNR(t,f)),
where SNR(t,f) is the SNR for frequency f at time t, for frequency selective
fading channel.
Information theory vs. queueing theory
- The problem setting for Information theory is
that 1) there is no queue in the system, and 2) the communication channel is
noisy or error prone. In contrast, the problem setting for
queueing theory is that 1) there is a queue in the system, and 2) the
communication channel is noise-free or error free.
- The problem, with which Information theory is
concerned, is the ultimate bound for source coding and channel coding.
In contrast, the problem that queueing theory addresses is queueing delay
distribution or buffer overflow probability (or loss probability).
- Constant arrival rate and constant departure
rate make queueing theory not so useful; e.g., in constant bit rate wireless
voice communication.
- sharing
- point-to-point communication; topology