Information Theory Samenvatting

Pepijn van Wijk

November 2025

Prerequisites

In Information Theory, the logarithm function will be used very regularly. In Information Theory, we predominantly use the base $2$ logarithm. Thus, whenever $\log(x)$ is used, it is implied that this is the base $2$ logarithm, i.e., $\log x = \frac{\ln (x)}{\ln (2)}$. Some important logarithm assumptions to remember:

The logarithm function is only defined on $\mathbb{R}_{+}$.
The logarithm function is strictly increasing: $\log(x) > \log(y)$ whenever $x > y$.
$\log(x+y)$ and $\log(x-y)$ generally do not have a simpler form.

Logarithm identities and rules

\[\begin{equation} \log(1) = 0 \end{equation}\]

\[\begin{equation} \log(x) = \begin{cases} > 0 & \text{if } x > 1, \\ < 0 & \text{if } 0 < x < 1, \\ 0 & \text{if } x = 1. \end{cases} \end{equation}\]

\[\begin{equation} \log(2) = 1 \end{equation}\]

\[\begin{equation} \log(2^x) = x \end{equation}\]

\[\begin{equation} 2^{\log(x)} = x \end{equation}\]

\[\begin{equation} \log(x \cdot y) = \log(x) + \log(y) \end{equation}\]

\[\begin{equation} \log(\frac{x}{y}) = \log(x) - \log(y) \end{equation}\]

\[\begin{equation} \log(x^y) = y \cdot \log(x) \end{equation}\]

\[\begin{equation} x^{\log(y)} = y^{\log(x)} \end{equation}\]

Probability Theory

Probabilities

A Probability space is a tuple $( \Omega, \mathscr{F}, \mathbb{P} )$. Here:

$\Omega \neq \emptyset$ is a (discrete) sample space, with $\mathscr{P}(\Omega) = \{ \mathscr{A} : \mathscr{A} \subseteq \Omega \}$
$\mathscr{F} \subseteq \mathscr{P}(\Omega)$ is an event space, with (normally): $\mathscr{F} = \mathscr{P}(\Omega)$
$\mathbb{P} : \Omega \rightarrow \mathbb{R}_{\geq 0}$ satisfying $\sum_{\omega \in \Omega} \mathbb{P}(\omega) = 1$ is a probability space

For example, considering a fair dice we have:

$\Omega = \{ 1, 2, 3, 4 ,5 ,6 \}$
$\mathscr{F} = \mathscr{P}(\Omega)$
$\mathbb{P}(\omega) = \frac{1}{6} \; \; \forall \omega \in \Omega$

An event is a subset $\mathscr{A} \in \mathscr{F}$, with a probability $\mathbb{P}[\mathscr{A} ] := \sum_{\omega \in \mathscr{A}} \mathbb{P}(\omega)$. For example, considering the fair dice from before, the ’event that the roll is even’:

$\mathscr{A} = \{ 2, 4, 6 \}$
$\mathbb{P}[\mathscr{A}] = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}$

Conditional probability can be interpreted as the question: ’Given event $\mathscr{A}$ occurred, how likely is event $\mathscr{B}$?’. Formally: given events $\mathscr{A}, \mathscr{B} \in \mathscr{F}$ with $\mathbb{P}[\mathscr{A}] > 0$, the probability of $\mathscr{B}$ given $\mathscr{A}$ is: \[\mathbb{P}[\mathscr{B} | \mathscr{A} ] := \frac{\mathbb{P}[\mathscr{A}, \mathscr{B}] }{ \mathbb{P}[\mathscr{A}] } = \frac{\mathbb{P}[\mathscr{A} \cap \mathscr{B}] }{ \mathbb{P}[\mathscr{A}] }\] For example, Given the roll is even (event $\mathscr{A}$), what is the probability that the roll is at most 2 (event $\mathscr{B}$)? (and vice-versa)

$\mathscr{A} = \{ 2, 4, 6 \}$ and $\mathscr{B} = \{ 1, 2 \}$
$\mathbb{P}[\mathscr{A}, \mathscr{B}] = \mathbb{P}[\mathscr{A} \cap \mathscr{B}] = \mathbb{P}[\{ 2 \}] = \frac{1}{6}$
$\mathbb{P}[\mathscr{A} | \mathscr{B} ] = \frac{\mathbb{P} [\mathscr{A}, \mathscr{B}] }{\mathbb{P} [\mathscr{B}]} = \frac{\frac{1}{6}}{\frac{1}{3}} = \frac{1}{2}$
$\mathbb{P}[\mathscr{B} | \mathscr{A} ] = \frac{\mathbb{P} [\mathscr{B}, \mathscr{A}] }{\mathbb{P} [\mathscr{A}] } = \frac{\frac{1}{6}}{\frac{1}{2}} = \frac{1}{3}$

A Random Variable (RV) is a function that described the outcome of an experiment involving randomness. In other words, given a probability space $( \Omega, \mathscr{F}, \mathbb{P} )$, a RV $X$ is a function \[X : \Omega \rightarrow \mathcal{X}\] for some set $\mathcal{X}$. If $\mathcal{X} \subseteq \mathbb{R}$, then $X$ is called real-valued.

Given a RV $X$ with image $\mathcal{X}$, its probability distribution is \[\forall x \in \mathcal{X}, \;\;\;\; P_{X}(x) := \mathbb{P}[X=x] = \mathbb{P}[ \{ \omega \in \Omega : X(\omega) = x ]\] If the underlying RV is unimportant, any function $P : \mathcal{X} \rightarrow [0,1]$ with $\sum_x P(x) = 1$ is called a probability distribution.

We call a probability distribution uniform if \[\forall x \in \mathcal{X}, \;\;\;\; P_{X}(x) = \frac{1}{|\mathcal{X}|}\]

Notice that this implies that the probability of any two $x, x' \in \mathcal{X}$ have the same probability, i.e. $P_{X}(x) = P_{X}(x'), \;\;\; \forall x, x' \in \mathcal{X}$.

The support of a $P_X$ or $X$ is \[\text{supp}(X) = \text{supp}(P_X) := \{ x \in \mathcal{X} : P_{X} (x) > 0 \}\]

A Joint Probability Distribution is a RV $P_{XY}$. Consider $X$ and $Y$ being two RVs on the same probability space with images $\mathscr{X}$ and $\mathscr{Y}$. The pair $(X, Y) = XY$ is a RV with probability distribution \[P_{XY} (x, y) := \mathbb{P}[ X=x, Y=y ]\] If $P_{XY}(x, y) = P_X(x) \cdot P_Y(y)$ for all $x \in \mathscr{X}, y \in \mathscr{Y}$, $X$ and $Y$ are called independent.

You can go from a joint distribution to a marginal distribution of the RV via marginalization: \[P_X(x) = \sum_{y \in \mathscr{Y}} P_{XY (x, y)}\]

Given RV $X$ and event $\mathscr{A}$ with $\mathbb{P} [\mathscr{A}] > 0$, the Conditional Probability Distribution \[P_{X | \mathscr{A}} (x) := \frac{\mathbb{P} [X=x, \mathscr{A}] }{\mathbb{P} [\mathscr{A}] }\] Similarly, given another RV $Y$ and $y \in \text{supp}(Y)$, \[P_{X | Y}(x | y) := P_{X | Y=y} (x) = \frac{P_{XY}(x, y)}{P_{Y}(y)}\]

Probability statistics

The Expectation of a real-valued RV tells us what we can ’expect’ to happen. Formally: \[\mathbb{E}[X] := \sum_{x \in \text{supp}(X)} P_X (x) \cdot x\] For example, the expectation of a fair dice is given by: \[\mathbb{E}[X] = \frac{1}{6} \cdot 1 + \frac{1}{6} \cdot 2 + \frac{1}{6} \cdot 3 + \frac{1}{6} \cdot 4 + \frac{1}{6} \cdot 5 + \frac{1}{6} \cdot 6 = \frac{21}{6} = 3.5\] If $\text{supp}(X) \in \mathscr{X}$ and $f : \mathscr{X} \rightarrow \mathbb{R}$: \[\mathbb{E}[f(X)] = \sum_{x \in \mathscr{X}} P_X (x) \cdot f(x)\] This is called the Law of the unconscious statistician. More generally: If $f : \mathscr{X} \rightarrow \mathscr{Y}$ is a function, $f(X)$ is a $\mathscr{T}$-valued RV.

An outcome of RV $X$ is not necessarily close to its expectation $\mathbb{E}[X]$. This is only the case if its variance is small. The variance of a RV is given by: \[\text{var}[X] := \mathbb{E}[( X - \mathbb{E}[X])^2 ] = \mathbb{E}[X^{2}] - \mathbb{E}[X]^{2}\]

Concentration inequalities

Markov’s inequality tells us that: \[\mathbb{P}[ |X| \geq t ] \leq \frac{\mathbb{E}[|X|] }{t}\]

Chebyshev’s inequality tells us that: \[\mathbb{P}[ |X - \mathbb{E}[X] | \geq t ] \leq \frac{\text{var}[X]}{t^2}\]

Let $X_1, ..., X_n$ be i.i.d. binary random variables with $\mathbb{E}[X_i] = \mu$, Hoeffding’s inequality tells us that: \[\mathbb{P}[ \sum_{i=1}^{n} X_i > (\mu + \delta)n ] \leq \text{exp}(-2\delta^{2}n), \;\;\;\;\;\;\;\; \forall \delta > 0\]

Important RVs

The Bernoulli Distribution is an RV that describes a biased coin flip: \[P_X(1) = p, P_X(0) = 1-p\] \[\mathbb{E}(X) = p \cdot 1 + (1-p) \cdot 0 = p\] \[\text{var}[X] = \mathbb{E}[ (X - \mathbb{E}[X])^2 ] = \mathbb{E}[(X - p)^2] = p(1-p)^2 + (1-p)(-p)^{2} = p(1-p)\]

The Binomial Distribution is an RV where $X_1 , ... , X_n \sim \text{Bernoulli(p)}$ are independent. \[X = \sum_{i=1}^{n} X_i \sim \text{Binomial}(n, p) , \;\;\; \forall k \in \{ 0, 1, ..., n \}, \;\;\; P_X(k) = {n \choose k}p^k (1-p)^{n-k}\] \[\mathbb{E}[X] = np\] \[\text{var}[X] = np(1-p)\]

Entropy

Jensen’s inequality

Convex and concave

A function is convex if, for any two points in a domain $\mathscr{D} \subseteq \mathbb{R}$, a straight line between the points $x_1, x_2 \in \mathscr{D}$ is above the function. In other words, a function $f : \mathscr{D} \rightarrow \mathbb{R}$ is convex if: \[\forall x_1, x_2 \in \mathscr{D} \;\; \text{and} \;\; \lambda \in [ 0, 1 ],\] \[\lambda f(x_1) + (1-\lambda)f(x_2) \geq f(\lambda x_1 + (1-\lambda)x_2 )\] Here, the left term represents the equation of a straight line, while the right term represents a point on the curve between $x_1$ and $x_2$.

If $f$ is convex, then $-f$ is concave. If a function is concave, the above equation holds with $\leq$ instead of $\geq$.

A function $f$ is strictly convex/ strictly concave if the above holds with equality only if

$\lambda \in \{ 0 , 1 \}$ OR
$x_1 = x_2$

Obviously, strictly concave if $-f$ is strictly convex.

Convexity or concavity can be proven as follows: If $f : \mathscr{D} \rightarrow \mathbb{R}$, $\mathscr{D}$ is open and that $f$ is twice differentiable on $\mathscr{D}$. Then: \[f''(x) \geq 0 \;\; \forall x \in \mathscr{D} \implies f \;\;\text{is convex}\] With $\leq$ implying concavity.

Jensen

Jensen’s theorem says something about $x_1, ..., x_n$. Let $f : \mathscr{D} \rightarrow \mathbb{R}$ be convex, $\mathscr{D}$ an interval and $n \in \mathbb{N}$. Let $p_1, ..., p_n \geq 0$ such that $\sum_{i=1}^{n} p_i = 1$. Then, $\forall x_1, ..., x_n \in \mathscr{D}$, \[\sum_{i=1}^{n} p_i f(x_{i}) \geq f(\sum_{i=1}^{n} p_i x_i )\] which is the same as: \[\mathbb{E}[f(X)] \geq f(\mathbb{E}[X] )\] With equality iff $x_1 = ... = x_n$, $f$ being strictly convex and all $p_i > 0$. Again, the reverse holds ($\leq$) for concave functions $f$.

Entropy

Entropy can be thought of as the information of a source. ’Given event $\mathscr{A}$, how informative is the event?’. Essentially entropy is the expected surprisal value (measured in bits).

The surprisal value of an event, also known as the Shannon Information Content of an event/ outcome is given for a single event $\mathscr{A}$ by: \[h(\mathscr{A}) := \log \frac{1}{\mathbb{P}[\mathscr{A}]}\]

The (Shannon) Entropy then, is the expected surprisal value of a $\mathscr{X}$-valued RV. \[H(X) := \mathbb{E}[\log \frac{1}{P_X(X)}] = \mathbb{E}[ -\log P_X(X) ] = \sum_{x \in \mathscr{X} } P_X(x) \log \frac{1}{P_X(x)} = - \sum_{x \in \mathscr{X}} P_X(x) \log P_X(x)\]

Let $X$ be a $\mathscr{X}$-valued RV. Then: \[H(X) \geq 0\] With equality iff $X$ is constant.

Let $X$ be a RV. Then: \[H(X) \leq \log( | \text{supp}(X) | )\] With equality iff $X$ is uniform. This is proven by Jensen’s inequality and fact that logarithms are strictly concave.

If $X$ follows a Bernoulli Distribution ($X \sim Bernoulli(p)$): \[H(X) = h(p)\]

If $X$ follows a uniform distribution over $\mathscr{X}$: \[H(X) = \log(| \mathscr{X} |)\]

The entropy of RV $X$ given some event $\mathscr{A}$ with $\mathbb{P}[\mathscr{A}] > 0$ is: \[H(X | \mathscr{A}) := \sum_{x} P_{X | \mathscr{A}}(x) \log \frac{1}{P_{X | \mathscr{A}} (x)} \geq 0\]

Let $X, Y$ be RVs with images $\mathscr{X}, \mathscr{Y}$. The Conditional Entropy of $X$ given $Y$ is: \[H(X | Y) := \sum_{y \in \mathscr{Y}} P_Y(y) \cdot H(X | Y = y)\]

\[= \sum_{y} P_Y(y) \sum_x P_{X|Y}(x|y) \log \frac{1}{P_{X|Y} (x|y) }\] \[= \sum_{x,y} P_{XY}(x, y) \log \frac{P_Y(y)}{P_{XY} (x, y) }\] Notice that always \[H(X|Y) \geq 0\] With equality if $X$ is determined by $Y$.

Given RVs $X, Y$: \[H(X|Y) \leq H(X)\] and \[H(Y|X) \leq H(Y)\] With equality iff $X$ and $Y$ are independent. This again can be proven by using Jensen’s inequality.

The Chain Rule states that: \[H(XY) = H(X) + H(Y | X)\] as well as: \[H(XY) = H(Y) + H(X | Y)\] From this, the subadditivity of entropy follows: \[H(XY) \leq H(X) + H(Y)\] With equality iff $X, Y$ are independent. This is proven by using the chain rule and $H(Y|X) \leq H(Y)$

The Chain Rule can be generalized: \[H(XY | Z) = H(X|Z) + H(Y | XZ)\] Going further: \[H(X_1, X_2, ..., X_n) = H(X_1) + H(X_2 | X_1) + ... + H(X_n | X_1,...,X_{n-1}) = \sum_{i=1}^{n} H(X_i | X_1, ..., X_{i-1})\]

Mutual Information tells us how connected two RVs $X, Y$ are, or how much seeing one outcome tells us about the other. \[I(X;Y) = H(X) - H(X|Y)\] Which, given the chain rule, \[I(X;Y) = H(X) + H(Y) - H(XY) = H(Y) - H(Y|X)\] Which implies that \[I(X; Y) = I(Y ; X)\] Also, \[I(X;Y) \geq 0\] With equality iff $X, Y$ are independent. And: \[I(X;X) = H(X)\] Also, \[I(X;Y) \leq H(Y)\] Lastly, anoteher nice identity of mutual information: \[H(X, Y) = H(X|Y) + I(X ; Y) + H(Y | X)\]

The following Entropy Diagram can be constructed:

Relative Entropy, also known as KL-Divergence is the expected excess surprise when modeling a source to take on distribution $Q$, but it actually follows distribution $P$: \[D(P||Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = \sum_{x} P(x) (\log\frac{1}{Q(x)} - \log \frac{1}{P(x)} )\] If $P = Q$, then $D(P||Q) = 0 = D(Q || P)$

If $\exists x \in \text{supp}(P)$ with $Q(x) = 0$, then $D(P||Q) = \infty$

$D(P||Q) \neq D(Q||P)$

$D(P||Q) \geq 0$, with equality iff $P=Q$

It is connected to mutual information as follows: \[I(X;Y) = D(P_{XY}|| P_X \cdot P_Y )\]

Cross-Entropy is the expected surprise when anticipating samples from $Q$, but receiving samples from $P$. \[H_C(P, Q) = \sum_{x} P(x) \log \frac{1}{Q(x)}\] With $H_C(P_X, P_X) = H(X)$

If $\exists x \in \text{supp}(P)$ with $Q(x) = 0$, then $H_X(P, Q) = \infty$

Source Coding & Data Compression

Compression

Suppose we have an information source modeled by a $\mathscr{X}$-valued RV $X$ with probability distribution $P_{X}$. The goal of compression is to compress a $x \sim P_X$ such that we can later decompress it to recover $x$. An upper bound on the length of this compression (in bits) is given by $\log(| \mathscr{X} |)$. Note that if $X \sim \text{Uniform}(\mathscr{X})$, we can not do better than this upper bound. If $X$ is biased however, we can do better on average.

Suppose you want the send many outcomes of a RV. Let’s say this RV is the sum of two fair dice. The string you send consists of binary codes, with each potential outcome $x$ being associated with a different binary string. Naturally, to efficiently send the string of solutions, you want the length to be as short as possible. To do this, you want to make sure the outcomes with the highest likelihood of occurring are associated with the shortest binary codes.

In this section, the goal is to find how short codes can be on average, for a source $X$.

Symbol Codes

Symbol codes encode a source $P_X$ symbol-by-symbol. Let $P_X$ be an $\mathscr{X}$-valued source. Then, a symbol code for the source $P_X$ with code alphabet $\mathscr{A}$ is an injective function \[C : \mathscr{X} \rightarrow \mathscr{A}^{\star}\] sometimes $C$ is called an encoding function. Above, the Kleene star is defined by: \[\mathscr{A}^{\star} = \bigcup_{n \in \mathbb{N}} \mathscr{A}^{n} \cup \{ \bot \}, \; \; \; \text{where } \bot \text{ is the empty string}\]

Call \[\mathscr{C} = \text{im}(C) = \{ C(x) : x \in \mathscr{X} \}\] a code, and elements of $\mathscr{C}$ codewords. We speak of a Binary symbol code if we have $\mathscr{A} = \{ 0, 1 \}$. Naturally, we are particularly interested in symbol codes whose average codeword length is as small as possible.

Let $C$ be an encoding function. Then we define an extended code $C^{\star} : \mathscr{X}^{\star} \rightarrow \mathscr{A}^{\star}$ by concatenation \[C^{\star}(x_1, x_1,...,x_n) := C(x_1) | C(x_2) | ... | C(x_n)\] Given that $C$ is injective, we can always recover $x$ from $C(x)$. However, given a $C^{\star}(x_1,...,x_n)$, we can not always recover the original $x1, ..., x_n$. That is to say, $C^{\star}$ might not be injective. One solution is by introducing a special ’separation symbol’. The problem with this however, is that such a symbol might A) not be available, or B) it increases the compressed string’s length.

A symbol code $C : \mathscr{X} \rightarrow \mathscr{A}^{\star}$ is called uniquely decodable if $C^{\star}$ is injective. This unique decodability can be guaranteed by ensuring a binary symbol code is prefix-free, also called instantaneous. This is the case if \[\forall x \neq x' \in \mathscr{X}, C(x) \; \text{is \textbf{not} a prefix of } \; C(x')\] Now, with this prefix-free encoding, we can recover $x_1,...,x_n$ from $C(x_1)|...|C(x_n)$ by reading the encoding from left-to-right. Thus:

If $C$ is prefix-free and $\text{im}(C) \neq \{ \bot \}$, then $C$ is uniquely decodable

Note however that the reverse does not hold. For example, consider a suffix-free symbol code. One nice property of uniquely decodable, prefix-free symbol codes is that you can decode the string on-the-fly/ in a stream. That is, you do not need all the bits of $C^{\star}(x_1, ..., x_n)$ to start decoding.

Codeword Length

The performance of a code can be defined by the average length of a code $C$ for a source $P_X$: \[\ell_{C}(P_X) := \mathbb{E}[\ell(C(X))] = \sum_{x \in \mathscr{X} } P_X(x) \cdot \ell(C(X))\] With this, we define the minimal code length of a given class of codes $\mathbb{C}$ of a source $P_X$ as follows: \[\ell^{\mathbb{C}}_{\text{min}} := \min_{C \in \mathbb{C}} \ell_{C}(P_{X})\] for example, for prefix-free codes this would be: $\ell^{\text{p.f}}_{\text{min}}(P_X)$.

Kraft’s inequality gives us necessary and sufficient conditions for the existence of prefix-free binary codes. It states as follows:

Let $\ell_1,...,\ell_m \in \mathbb{N}$. There exists a prefix-free binary code $\mathscr{C} = \{ c_1 , ..., c_m \} \in \{ 0, 1 \}^{\star}$ such that $\ell(c_i) = \ell_i \;\; \forall i = 1, ..., m$, if and only if $\sum_{i=1}^{m} 2^{-\ell_{i}} \leq 1$.

This turns out to also be true for uniquely decodable codes. MacMillan’s inequality states:

For a uniquely decodable binary code $\mathscr{C}$: $\sum_{c \in \mathscr{C}} 2^{-\ell(c)} \leq 1$

Thus, we now know that for a source $P_X$, $\ell^{\text{p.f.}}_{\text{min}}(P_X) = \ell^{\text{u.d.}}_{\text{min}}(P_X)$. We will now define $\ell_{\text{min}}(P_X) := \ell^{\text{p.f.}}_{\text{min}}(P_X) = \ell^{\text{u.d.}}_{\text{min}}(P_X)$. If a code $C$ is such that $\ell_C(P_X) = \ell_{\text{min}}(P_X)$, it is called optimal for the source $P_X$. Naturally, if $P_X$ is more unpredictable, $\ell_{\text{min}}(P_X)$ should be larger. Now, using Jensen’s inequality and Kraft’s inequality, it can be proven that \[H(X) \leq \ell_{\text{min}}(P_X) < H(X) + 1\] Note that this is for a single outcome. For $n$ outcomes, we roughly need $nH(X)$ bits, with a worst case length of $nH(X) + n$.

Source-Coding Algorithms

Now that we know that there exist good prefix-free codes, the question remains: how we can find such codes?. That is, given a source $P_X$, can we find a prefix-free code $C$ with $\ell_C(P_X) = \ell_{\text{min}}(P_X) \approx H(X)$? It turn we can do this using Huffman coding.

In Huffman coding, we construct a prefix-free code as a binary tree. The idea is to put down leafs for each sumbol $x$ with weight $P_X(x)$, iteratively grouping the lightest symbols. We do this as follows:

$\forall x \in \mathscr{X}$, pyt down node $v_{x}$ corresponding to $x$ with weight $\text{wt}(v_{x}) = P_X(x)$
Repeat until the tree is connected:
1. choose 2 lightest nodes $u, v$
2. add node $w$ and assign it weight $\text{wt}(w) = \text{wt}(u) + \text{wt}(v)$

The created Huffman code $C_{H}$ is optimal and prefix-free. That is, for any other uniquely decodable code $C'$, \[\ell_{C_{H}}(P_X) \leq \ell_{C'}(P_X) \;\;\; \text{that is:} \;\;\; \ell_{C_{H}}(P_X) = \ell_{\text{min}}(P_X)\]

Using the Huffman algorithm to obtain a binary code does not always yield the most optimal compression. Consider a biased coin flip with probabilities $0.99$ and $0.01$. Using the Huffman code, you would need one bit to encode such a RV. The entropy of this RV however, is only $0.08$. Much lower than what we have now, yet we desire our code to be as close to the entropy as possible. In fact, using a symbol code requires you to use $\geq 1$ bit per source character, even if the entropy of the underlying RV is much lower than $1$.

Arithmetic Codes

For many real-life use cases, a naive symbol code is too simplistic. When compression text for example, the distribution of characters changes depending on the preceding characters. In order to efficiently encode human language, we thus want a code which takes in to account a potentially changing distribution, depending on previous outcomes. You could, for example run the Huffman algorithm at each character. Obviously, this is far from ideal. To solve this problem, we will end up using arithmetic codes. Arithmetic codes are based on the idea that you can view any transmission in a binary alphabet as defining an interval on the real numbers line.

First, we define a standard binary representation of a real number $r \in [ 0,1 )$ is a (possibly infinite) string of bits \[c_1,c_2,c_3,... \in \{ 0, 1 \} \; \text{such that} \; \sum_{i=1}^{\infty} c_i 2^{-i} = r\] by convention, we take the shortest such representation. Note that some reals in $[0,1)$ do not have a finite representation. But, any interval $[ a, b )$ with $a < b$ has a number $r \in [ a, b )$ with a finite representation. Another note is that adding a $0$ on the left divides the represented number by $2$.

Let $P_X$ be a source with $\mathscr{X} = \{ x_1, ..., x_m \}$. Divide the interval $[ 0, 1 )$ into sub-intervals $I_{x_{j}} = [ a_{j-1}, a_{j} )$ for $j = 1, ..., m$, where $a_{j} = \sum_{i=1}^{j} P_X(x_i)$. We can now define the arithmetic symbol code $\text{AC} : \mathscr{X} \rightarrow \{ 0, 1 \}^{\star}$: \[\text{AC}(x_j) := \text{\textbf{shortest} possible binary representation of an } \; r \in I_{x_{j}}\] Using this code, it is true that \[H(X) <\ell_{\text{AC}} (P_X) < H(X) + 1\] Note that arithmetic codes are not necessarily prefix-free, or even uniquely decodable. They can however be made prefix-free at the cost of one extra bit (on average).

Although the average length of an arithmetic code is longer than that of a Huffman code, a big advantage is that it can adapt to changing distributions. For example, when encoding a stream of English text, the next probability of the next letter is heavily influenced by the letter preceding it. Because of this advantage, arithmetic codes are used in video compression algorithms such as AVC and HEVC.

Typical Sets & Encryption

We call a sequence of observed i.i.d. samples $\vec{x} = (x_1, ..., x_n)$ from a RV $X$ typical if it has an average surprise $H(X)$. Intuitively, if we want to encode our data as in as little symbols as possible, typical strings should get short encodings, while non-typical strings should get longer encodings. the typical set containing typical sequences, then, must not be too big (as we fewer short codewords), and must contain most of the probability mass.

The Law of large numbers states that the average of the result obtained from a large number of i.i.d. samples converges to the true value. We can use this law to

Asymptotic Equipartition Property

A sequence $x_! , x_2, ...$ of real numbers, $x_n \rightarrow^{n \rightarrow \infty} x$ means that: \[\forall \epsilon>0 , \exists n_{0} \in \mathbb{N} \;\; \text{s.t} \;\; \forall n \geq n_{0}, \;\; |x_{n} -x| \leq \epsilon\] We then say this sequence converges to $x$.

Similarly, we can define the limit of random variables $X_1, X_2, ...$ to random variable $X$ as follows:

In probability: $X_n \rightarrow^{p} X$ if: \[\forall \delta > 0 , \mathbb{P}[ | X_n - X | > \delta] \rightarrow^{n \rightarrow \infty} 0\] In words, "the probability that $X_n$ is far from $X$ goes to $0$."

In mean square $X_n \rightarrow^{\text{{m.s.}}} X$ if: \[\mathbb{E}[ (X_n - X)^{2} ] \rightarrow^{n \rightarrow \infty} 0\] In words, "The expected square of the difference $| X_n - X |$ goes to 0."

In almost surely: $X_n \rightarrow^{\text{a.s.}} X$ if: \[\mathbb{P}[ \lim_{n \rightarrow \infty} X_n = X ] = 1\]

The Weak Law of Large Numbers tells us something about the convergence of random variables. Let $X_1, X_2, ...$ be real i.i.d. RVs with finite mean ($\mathbb{E}[X_i] = \mu \in \mathbb{R}$). Then, if we define RVs $S_n = \frac{1}{n} \sum_{i=1}^{n} X_i$, then $S_n \rightarrow^{p} \mu$ (here $\mu$ is a constant RV).

There is also an entropy variant of the weak law of large numbers, the Asymptotic Equipartition Property (AEP): Let $X_1, X_2, ...$ be i.i.d. random variables with distribution $P_X$ and finite mean. Then \[- \frac{1}{n} \log P_{X_1...X_n} (X_1, ..., X_n) \rightarrow^{p} H(X)\] Where again, $H(X)$ is a constant RV. In words, the AEP states that there is a certain probability, around which the probability of outcomes of a sequence of RVs concentrates.

With this, we now have a way to formalize whether or not a sequence $(x_1, ..., x_n) \in \mathscr{X}^n$ coming from a source $P_X$ is typical. Formally, a typical set can be defined as follows:

$X_1, X_2, ...$ are i.i.d. RVs $\sim p_X$, then the typical set $A_{\epsilon}^{(n)}$ with respect to $P_X$ is the set of sequences $(x_1, x_2, ..., x_n) \in X^{n}$ with the property: \[2^{-n(H(X)+\epsilon)} \leq P_{X}(x_1, x_2, ..., x_n) \leq 2^{-n(H(X)-\epsilon)}\] Which, because the RVs are i.i.d. is the same as: \[2^{-n(H(X)+\epsilon)} \leq \frac{1}{n}\log(\frac{1}{P_X(x_1, x_2, ..., x_n)}) \leq 2^{-n(H(X)-\epsilon)}\] Which is

\[H(X)-\epsilon \leq \frac{1}{n}\log P_X(x_1, ..., x_n) \leq H(X)+\epsilon\]

This means that for fixed $\epsilon$, the typical set is composed of all sequences whose probabilities are close to $2^{-nH(X)}$. Thus, in order for a sequence to belong to the typical set, it just has to have a probability close to $2^{-nH(X)}$. The middle term in the definition can be thought of as a the sample entropy of the sequence. Thus, in other words, the typical set is the set of sequences which give an amount of information close to the average information of the RV X. This is also the reason why the most probable sequence in general does not belong to the typical set. Because it gives less information than average, because the less probable an outcome, the more information it gives.

Also note that as $n$ increases, more-and-more sequences become typical. Note though, that although the absolute size of the typical set gets larger, the relative size compared to the total number of possible outcomes shrinks (exponentially). The size of the typical set is roughly $2^{nH(X)}$.

The concept of typical sets can be used in source coding. We code the source in blocks of length $n$ characters, giving us $\vec{x}^{n}$. With this, we can create two types of codes:

Lossy codes. If $\vec{x}^{n} \in A_{\epsilon}^{n}$, we assign it a binary label of length $\lceil n(H(X) + \epsilon) \rceil$. If $\vec{x}^{n} \notin A_{\epsilon}^{n}$, we assign it a dummy label. This type of code unfortunately, loses information with probability a $\leq \epsilon$.

Lossless codes. Here, if $\vec{x}^{n} \notin A_{\epsilon}^{n}$, we assign it binary labels of length $\lceil n \log |\mathscr{X}| \rceil$. Here, we need to add a 1-bit prefix flag to indicate whether or not a string was typical. It can be shown that the expected length for such a code is more efficient than a Huffman- or arithmetic code, the expected length of such a lossless code is given by: $\mathbb{E}[ \ell(C(X^{n})) ] \leq n(H(X)+\epsilon)$. Note that for this to work we do need a flag to indicate the typical strings. For large enough blocks, this extra cost is negligible, but this is not always possible.

Encryption

We first define some terms:

Let $X, Y, Z$ be random variables. The Conditional Mutual Information of $X$ and $Y$ given $Z$ is then: \[I(X ; Y | Z) = \sum_{z \in \mathscr{Z}} P_Z(z) I(X;U | Z=z)\] And \[I(X ; Y | Z) = H(X|Z) - H(X|YZ) = H(Y|Z) - H(Y|XZ)\] This conditional mutual information has some properties that are very similar to mutual information’s properties: \[I(X; Y | Z) = I(Y ; X | Z ) \;\;(\textbf{symmetry})\] \[I(X;Y | Z) \geq 0 \;\; (\textbf{non-negativity})\] \[I(X;Y | Z) = 0 \; \text{iff $X$ and $Y$ are \textbf{independent given }$Z$} \; (P_{XY|Z} (x,y|z) = P_{X|Z}(x|z) \cdot P_{Y|Z}(y|z) )\] We can also use this conditional mutual information in the chain rule: \[I(WX ; Y) = I(X;Y) + I(W ; Y | X)\] \[I(WX ; Y|Z) = I(X;Y|Z) + I(W ; Y | ZX)\] \[I(WX ; Y |Z) = H(WX |Z) - H(WX | YZ)\]

Perfectly Secure Encryption

An Encryption Scheme for the message $M$ consists of a key $K$ and a ciphertext $C = \text{Enc}(M, K)$ for an encryption function $\text{Enc}$ such that: \[\begin{align*} & 1. \; I(M ; K) = 0 \;\; \text{(a \textbf{setup assumption})} \\ & 2. \; H(M | KC) = 0 \;\;\text{(a \textbf{correctness assumption})} \\ & 3. \; I(M ; C) = 0 \end{align*}\] Also, vice-versa we can get $M = \text{Dec}(K, C)$ for a decryption function $\text{Dec}$.

Note that to satisfy the correctness assumption $H(M | KC) = 0$, for each key value $k$, the encoding function $m \rightarrow \text{Enc}(m, k)$ must be injective.

A perfectly secure encryption scheme is one such that someone who intercepts the encrypted message $C$, learns nothing about the underlying message $M$. Formally, this is the case if: \[I(M ; C) = 0\] This is called information-theoretic security. In reality, much of modern cryptography is only computationally secure. This means that even if $C$ gives some information about $M$ to the eavesdropper, it is computationally infeasible to find the theoretically revealable part of $M$.

One example of a perfectly secure encryption scheme is the one-time pad. Let $(G, +)$ be an abelian / additive group. \[\begin{align*} & 1. \text{Let } \mathscr{M} = G \text{ and let } M \; \text{be uniform over } \mathscr{M} \\ & 2. \text{Let } \mathscr{K} = \mathscr{M} = G \text{ and let } K \; \text{be uniform over } \mathscr{K} \\ & 2. \text{Let } \mathscr{C} = \mathscr{M} = G \text{ as well} \end{align*}\] Now, define for $m \in \mathscr{M}$ and $k \in \mathscr{K}$, $\text{Enc}(m,k) = m + k = c$. Then, for $c \in \mathscr{C}$ and $k \in \mathscr{K}$, $\text{Dec}(c, k) = c + (-k) = m$

For example, say $(G, +) = (\{ 0, 1\}^{4} , XOR)$, as well as $m = 0101$ and $k = 1100$ Then we have: \[\text{Enc}(m, k) = \text{Enc}(0101, 1100) = 0101 \; XOR \; 1100 = 1001\] \[\text{Dec}(k, c) = \text{Enc}(1100, 1001) = 1100 \; XOR \; 1001 = 0101\] This one-time pad code is perfectly secure. One major drawback however is the fact that the key needs to be as long as the message. Also, reusing the key leaks information. Unfortunately, this limitation is inherent to any perfectly secure encryption scheme.

For any perfectly secure encryption scheme: \[H(K) \geq H(M)\]

Random Processes

So far, in encryption and source coding we have only dealt with i.i.d. random variables. In real life however, RVs are often heavily linked and dependent. For example, in English text the next character is heavily dependent on which characters precede it.

Markov Chains

A Markov Chain is a process that transitions between states, such that the probability of moving from state $Y$ to a new state $Z$ is solely dependent on $Y$. Thus, a sequence of random variables forms a Markov Chain if the distribution of a RV only depends on the one that preceded it.

The RVs $X, Y, Z$ form a Markov Chain $X \rightarrow Y \rightarrow Z$ if $P_{Z|XY} = P_{Z|Y}$, for all $x \in \mathscr{X}, y \in \mathscr{Y}, z \in \mathscr{Z}$.

With this, the following three statements are equivalent:

$X \rightarrow Y \rightarrow Z$
$Z \rightarrow Y \rightarrow X$
$P_{XZ|Y} = P_{X|Y} \cdot P_{Z|Y} = 0$, that is, $I(X;Z | Y) = 0$

Because of this equivalence, we often write $X \leftrightarrow Y \leftrightarrow Z$.

The data-processing inequality says that \[\text{if } X \rightarrow Y \rightarrow Z \text{, then } I(X;Y) \geq I(X;Z)\] With equality holding iff: \[I( X ; Y | Z) = 0\] This inequality also means that for any function $f$ on the range of $Y$, \[I(X ; Y) \geq I(X ; f(Y))\]

Let $P_{X}^{\theta}$ be a family of distributions which is parametrized by $\theta$. Now, let $T$ be any statistic (function $T : \mathscr{X} \rightarrow \mathbb{R}$). Then this forms the following Markov Chain: \[\theta \rightarrow X \rightarrow T(X)\] Thus, according to the data-processing inequality: \[I(\theta ; X) \geq I(\theta ' T(X) ) \text{ with equality iff } I( \theta ; X | T(X) ) = 0\] That is, the equality holds iff $\theta \leftrightarrow T(X) \leftrightarrow X$ is also a Markov Chain. In this case, $T(X)$ tells us everything we need to know to determine $\theta$. We call $T$ a sufficient statistic if $I(\theta ; X | T(X)) = 0$. In other words, for each $t$, $X|T(X) = t$ is independent of $\theta$.

Stochastic Processes

A Stochastic Process is a sequence of RVs \[\{ X_{i} : \Omega \rightarrow \mathscr{X} \} \text{ for i} \in \mathbb{N}\] The process is characterized by the collection of probability distributions \[\{ P_{X_n | X_1...X_{n-1}} : n \in \mathbb{N} \}\] Equivalently, the process is characterized by a collection of joint probability distributions \[\{ P_{X_{1}...X_{n}} : n \in \mathbb{N} \}\] Note that all RVs have the same domain (the probability space $\Omega$) and the same range $\mathscr{X}$. Also, we can have $|\Omega| = \infty$, if each $X_n$ only depends on the finite part of $\Omega$.

A stochastic process $\{ X_i : \Omega \rightarrow \mathscr{X} \}$ is a Markov Process (or Markov chain) if $\forall n \in \mathbb{N} , \text{ } X_1 \rightarrow X_2 \rightarrow ... \rightarrow X_n$. Thus, each step only depends on the previous step (it is memoryless). To describe the process, we just need to give $\{ P_{X_{n} | X_{n-1}} : n \in \mathbb{N} \}$ and $P_{X}$.

A Markov Process is time-invariant if \[\forall n \in \mathbb{N} \text{, } \forall a, b \in \mathscr{X} \text{ with } P_{X_{n}}(b) > 0 \text{ \& } P_{X_{1}}(b) > 0,\] \[P_{X_{n+1} | X_{n}}(a|b) = P_{X_{2} | X_{1}}(a|b)\] We can visualize time-invariant Markov Processes with state diagrams: TODO

A Markov Process is finite-state if $| \mathscr{X} | < \infty$.

An alternative representation for a finite-state, time-invariant Markov process is given by the transition matrix $R$ (and the initial distribution). For $\mathscr{X} = \{ 1, 2, ..., k \}$. Define, for $i,j \in \mathscr{X}$, \[R_{ij} = P_{X_2 | X_1}(j | i)\] Where $R_{ij}$ represents the probability of moving from state $i$ to state $j$. This then defines a transition matrix $R \in \mathbb{R}^{k \times k}$.

A stationary distribution for a time-invariant Markov process is a distribution $P_{X_{n}}$ such that $P_{X_{n+1}} = P_{X_{n}}$. That is: \[\forall x \in \mathscr{X} \text{, } P_{X_{n+1}}(x) = \sum_{x'} P_{X_{n+1} | X_{n}} (x | x') P_{X_{n}}(x') = P_{X_{n}}(x)\] In vector notation: a vector $\mu \in \mathbb{R}^k$ such that $\mu R = \mu$. If $P_{X_1}$ is a stationary distribution, then the entire process is stationary.

It can also be shown that every time-invariant finite-state Markov process has a stationary distribution. In general, in order to find the stationary distribution, you solve for $\mu$ satisfying $\mu R = \mu$ and $\sum_{i=1}^{k} \mu_{i} = 1$.

Note that, even though every finite state, time-invariant Markov process has a stationary distribution,

It may not be unique
the process may not converge to it.

We need two additional properties to guarantee uniqueness and convergence:

Irreducibility. A time-invariant Markov process is irreducible if every state is reachable from any other state in a finite number of steps.

Aperiodicity. A state in a time-invariant Markov Process is aperiodic if the greatest common denominator of all path lengths from that state to itself is $1$. The process is aperiodic if all states are aperiodic.

Now, we can state (proof omitted), that if a time-invariant Markov Process is finite-state, irreducible and aperiodic, then it has a unique stationary distribution. Moreover, from any initial distribution, the distribution $P_{X_{n}}$ converges to the stationary distribution as $n \rightarrow \infty$.

Entropy Rate

TODOOOO

Error Correction & Zero Error Transmission

So far, we have focused on compressing information. First, we looked at symbol-to-symbol codes from i.i.d. sources. Next, we have looked at i.i.d. sources but encoding them in blocks. Finally we looked at general stochastic processes. In these sections, we have assumed there not to be some amount of unavoidable noise, leading to unpredictable errors corrupting our data stream. If our data is compressed, a single bit-flip can lead to the loss of the original data. Thus, we need a method of adding redundancy to our data to protect it from errors.

Noisy Channels

A Noisy Channel is a discrete, memory-less channel, given by a tuple $(\mathscr{X}, P_{Y|X}, \mathscr{Y})$ where:

$\mathscr{X}, \mathscr{Y}$ are finite sets of possible inputs and outputs respectively.
$\forall x \in \mathscr{X}, \;\; P_{Y|X=x} : \mathscr{Y} \rightarrow [ 0,1 ]$ is a probability distribution.

Thus, a channel defines a conditional distribution, giving the probability of reading output $y \in \mathscr{Y}$, given input $x \in \mathscr{X}$.

The fact that the channel is memory-less, means that the output depends only on the current input. That is, if we use the channel $n$ times, you can regard it as a channel $( \mathscr{X}^{n} , P_{Y^{n} | X^{n}}, \mathscr{Y}^{n} )$ where \[P_{Y^{n} | X^{n}}(\vec{x}, \vec{y}) = \prod_{i=1}^{n} P_{Y|X}(y_i | x_i)\]

A channel is deterministic if the output is determined by the input: \[\forall P_{X}, \;\; H(Y|X) = 0\] i.e. \[\forall x \in \mathscr{X} , \;\; \exists y \in \mathscr{Y} \text{ such that } P_{Y|X}(y | x) = 1\]

A channel is lossless if the input is determined by the output: \[\forall P_{X}, \;\; H(X | Y) = 0\] i.e. \[\forall y \in \mathscr{Y} , \;\; \exists \text{ unique } x \in \mathscr{X} \text{ such that } P_{Y|X}(y|x) > 0\]

A channel is noiseless if it is both deterministic and lossless.

Error-Correcting Codes

Let $M, n \in \mathbb{N}$. Then a $(M,n)$-code for the channel $( \mathscr{X}, P_{Y|X} , \mathscr{Y} )$ consists of:

An index set $[ M ] = \{ 1, 2, ..., M \}$ of possible messages.
A (possibly randomized) injective encoding function $\text{enc} : [M] \rightarrow \mathscr{X}^{n}$. Here, $n$ is the number of channel uses to send a message
A deterministic decoding function $\text{dec} : \mathscr{Y}^{n} \rightarrow [M]$

We call the set of all codewords $\{ \text{enc}(m) : m \in [M] \}$ the codebook (which have length $n$), or the code. Another notation is $[n,k]$-code. This means that you encode a $k$-bit message into $n$-bit codewords.

The rate $R$ of a $(M,n)$-code is \[R = \frac{\log M}{n} = \frac{k}{n} \text{ for a } [n,k]\text{-code}\] This can be interpreted as the ratio of useful bits to total bits transmitted per channel use.

Given a $(M, n)$-code for the channel $( \mathscr{X}, P_{Y|X} , \mathscr{Y} )$, the probability of error $\lambda_{m}$ for a message $m \in [M]$ is: \[\lambda_{m} := \mathbb{P}[ \text{dec}(Y^{n}) \neq m | X^{n} = \text{enc}(m) ]\] That is, it is the probability that the decoded output does not equal input $m$. \[\text{\textbf{maximal} probability of error: } \lambda^{(n)} := \max_{m \in [M]} \lambda_{m}\] \[\text{\textbf{Average} probability of error: } p_{e}^{(n)} := \frac{1}{M} \sum_{m=1}^{M} \lambda_{m}\]

The number of bit flips a code can correct for depends on how spread out its codewords are. Formally, given a binary code with codebook $C$, the minimal distance of that code is \[d_{\min} := \min \{ d(x,y) : x,y \in C, x \neq y \}\] Where $d(x, y)$ is the Hamming Distance between $n$-bit strings, indicating how many bits you need to flip to turn $x$ into $y$. \[d(x, y) = \sum_{i=1}^{n} | x_i - y_i | = | \{ i \in [n] : x_i \neq y_i \} | = | x \oplus y |\] Now, we can also denote a $[n,k]$-code as a $[n,k,d]$-code, where $d = d_{\min}$. A code with $d_{\min} = d$ can:

Detect up to $d-1$ bit flips.
Correct up to $\frac{d-1}{2}$ bit flip errors.

Linear Codes

A code $C \subseteq \{ 0, 1 \}^{n}$ is linear if \[c, c' \in C \impliedby c \oplus c' \in C\] This means that \[C \text{ is linear} \impliedby\implies C \text{ is a subspace}\] We’ll always talk about binary linear codes, meaning that addition/ multiplication are done modulo 2.

One nice property of linear codes is that they can be encoded conveniently. Given a $[n, k]$-linear code $C$, a generator matrix $G$ is a $k \times n$ matrix such the the rows form a basis for $C$. By row-reduction, $G$ can be transformed into systematic form. \[G = [ I_k | P ]\] Where $I_k$ is the $k \times k$ identity matrix. And $P$ is a $k \times (n-k)$ matrix representing parity bits.

To encode a message $m \in \{ 0, 1 \}^{k}$, we view it as a row vector and compute: \[\text{enc}(m) = mG = [m] [G]\] The codebook is \[C = \{ mG : m \in \{ 0, 1 \}^{k} \}\]

To decode, a parity-check matrix $H$ helps: \[H := [P^{T} | I_{n-k} ]\] Then, given a received word $y \in \{ 0, 1 \}^{n}$, the syndrome is $yH^{T}$. The syndrome tells you something about what is wrong. Thus, an uncorrupted message $c \in C$ has syndrome $\vec{0}$. Specifically, the syndrome will be the transpose of the $i$-th column of $H$, where $i$is the bit that was flipped. Thus, from the syndrome you can make a strong guess regarding which bit was flipped, if any.

Zero-Error Channel Capacity

Sometimes, it is possible to transmit a message through a noisy channel perfectly. For example, when: \[\text{max'l prob. of error } \lambda^{(n)} = 0\]

Given a channel with confusability graph $G$, the maximal message set $[M]$ that can be communicated perfectly in a single channel use s of size $\alpha(G)$, where $\alpha$ is the independence number of graph $G$. Thus, the set $\{ \text{enc}(m) : m \in [M] \} \subseteq \mathscr{X}$ forms an independent set in $G$, meaning $M \leq \alpha(G)$.

Now, let $S = \{ x_1, ..., x_{\alpha(G)} \subseteq V(G) = \mathscr{X} \}$ be an independent set of maximal size. Define

$\text{enc}(m) = x_{m} , \;\: m \in [\alpha(G)]$
$\text{dec}(y) = \text{unique } m \in [\alpha(G)] \text{ such that } P_{Y|X}(y | x_{m}) > 0$

This code has zero error!

Sometimes you can communicate at higher rate when using the channel more often. The Shannon capacity of a graph determines how many bits can be communicated on average per channel use, and is given by: \[c(G) := \sup_{n \in \mathbb{N} } \frac{\log \alpha(G^{\boxplus n})}{n}\] Where $\boxplus$ is the strong product of a graph with itself, $n$ times, i.e. $G^{\boxplus n} = G \boxplus ... \boxplus G$. Sometimes, it is defined as \[c(G) := \sup_{n \in \mathbb{N} } \frac{ \alpha(G^{\boxplus n})}{n}\] This counts the number of messages per channel use. Also, note that \[c(G) = \lim_{n \rightarrow \infty} \frac{\log \alpha(G^{\boxplus n})}{n}\] Unfortunately, $c(G)$ is very hard to compute. In fact, it is not even known to be decidable.

Now, we will take a look at communicating over noisy channels when allowing a small error probability. The channel capacity of a discrete, memoryless channel $\mathscr{X} , P_{Y|X} , \mathscr{Y}$ is \[C := \max_{P_{X}} I(X ; Y)\] Multiple channel uses does not increase channel capacity.

Often computing the capacity $C$ will be easiest if you define mutual information from the viewpoint of the output, i.e. $I(X;Y) = H(Y) - H(Y|X)$.

Intuitively, it is harder to send information with zero-error than with some small amount of error. Thus, let $(\mathscr{X}, P_{Y | X} , \mathscr{Y})$ be a discrete, memoryless channel of capacity $C$, and let $G$ be its confusability graph. Then \[c(G) \leq C\]

Information can be communicated through a DMC with an arbitrarily small probability of error at any rate less than the channel capacity $C$.

Noisy Channel Coding

Like discussed in the previous section, channel capacity represents the maximum amount of information that could in principle be sent over a noisy channel. We don’t know whether or not this capacity is achievable, however.

Given a channel, a rate $R$ is achievable if there exists a sequence of $(\lceil 2^{Rn} \rceil, n )$-codes for $n \in \mathbb{N}$ such that \[\lim_{n \rightarrow \infty} \lambda^{(n)} = 0\] Where $\lambda^{(n)}$ is the max error probability. Also, a $(\lceil 2^{Rn} \rceil, n )$-code has a rate $\geq R = \frac{\log(2^{Rn})}{n}$.

Shannon’s noisy coding theorem states the following:

For a discrete memoryless channel with capacity $C$, any rate $R < C$ is achievable.

That is, $\forall \epsilon > 0$, $\exists n_{0} \in \mathbb{N}$ such that for all $n \geq n_{0}$, there exists a $(2^{Rn}, n)$-code with maximal error $\lambda^{(n)} \leq \epsilon$. This is proven by showing a randomly constructed code has low average error probability (with high probability), which implies there exists a specific code with low maximal error probability.

Suppose we see $Y = \text{output}$ of a noisy channel. Let $\hat{X} = g(Y)$ be our first guess for input $X$. Note that $X \rightarrow Y \rightarrow \hat{X}$ forms a Markov chain. Also note that if $H(X|Y)$ is large, then it should be harder to guess $X$. Fano’s inequality quantifies this intuition.

Let $X$ & $Y$ be RV’s and let $\hat{X} = g(Y)$ for some function $g$. Define $p_e := \mathbb{P}[ X \neq \hat{X} ]$, the probability of error. Then \[H(X | Y) \leq p_e \cdot \log(| \mathscr{X} | -1 ) + h(p_e)\] Since $h(p_e) \leq 1$: \[p_e \geq \frac{H(X|Y) - 1}{\log(|\mathscr{X}| -1 )}\]

On a discrete, memoryless channel with capacity $C$, any code with rate $R > C$ has an average probability of error: \[p_e^{(n)} \geq 1 - \frac{C}{R} - \frac{1}{nR}\] Note that for large enough $n$, this is always $>0$, i.e. $\lim_{n \rightarrow \infty} p_e^{(n)} \neq 0$.

With these theorems in place, the question remains ’How should we transmit a source’s output over a noisy channel?’

Let $V_1 , ..., V_n \sim^{iid} P_V$. Let $(\mathscr{X}, P_{Y|X} , \mathscr{Y} )$ be a channel with capacity $C$. If $H(V) < C$, then it turns out there exists a code with \[\mathbb{P}[\hat{V}^{n} \neq V^n ] \rightarrow^{n \rightarrow \infty} 0\] The analogous result holds for any finite-state stochastic process satisfying AEP with entropy rate $H(\{ V_i\}) < C$.

Similarly, Let $V_1 , ..., V_n \sim^{iid} P_V$. Let $(\mathscr{X}, P_{Y|X} , \mathscr{Y} )$ be a channel with capacity $C$. If $H(V) > C$, then any source-channel code has error probability bounded away from $0$: \[\lim_{n \rightarrow \infty} p_e^{(n)} \neq 0\]