Information Theory and Entropy – Shannon’s Mathematical Theory of Communications – A Guide to Mathematical Studies (Archive)

In 1948, Claude E. Shannon, a mathematician and electrical engineer, revolutionized the fields of telecommunications and information processing with his groundbreaking work, “A Mathematical Theory of Communication”¹. Shannon’s theories provided a formal foundation for the science of information, introducing concepts that are now integral to digital communication, data compression, and even cryptography. Central to his theory are the ideas of information entropy and the quantification of information, which have profound implications across multiple disciplines. In this installment, we will look more deeply into what Shannon provided in terms of how information can be measured and transmitted, so without further ado, let’s begin!

So, what exactly is information theory?

Information theory, as defined by Britannia, is ‘the mathematical representation of the conditions and parameters affecting the transmission and processing of information’, or more simply put, is the study of quantification, storage, and communication of information. Shannon’s theory addresses several fundamental problems:

How can we measure information?
What is the maximum rate at which information can be transmitted over a communication channel?
How can we ensure the accuracy of information transmission in the presence of noise?

Shannon’s Measure of Information

To proceed any further into solving these problems, we first need to understand what information is. This is often defined as a chain of bits, representing a logical state with one of two possible values, in this case 0 or 1. Shannon built on this by proposing that information could be measured in terms of uncertainty or entropy. The more uncertain or unpredictable an event is, the more information it contains once it is known.

Shannon defined the entropy $H$ of a discrete random variable $X$ with possible values $x_1, x_2, \ldots, x_n$ and corresponding probabilities $p(x_1), p(x_2), \ldots, p(x_n)$ as:

$H(X) = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)$

By prioritizing words which have a higher probability of occuring (in this case giving them a shorter code word), we can reduce the information needed to be transferred thus reducing the probability of data loss.

In this context, entropy represents the average amount of information produced by a stochastic source of data. Higher entropy indicates a higher level of uncertainty and, therefore, more information content.

The Concept of Entropy

Entropy, in Shannon’s sense, is a measure of unpredictability or information content. It is analogous to the concept of entropy in thermodynamics, where it represents the degree of disorder or randomness in a system. In information theory, entropy quantifies the expected value of the information contained in a message.

Consider a simple example: flipping a fair coin. There are two possible outcomes, heads or tails, each with a probability of 0.5. The entropy for this system is calculated as:

$H = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \text{ bit}$

This means that one bit of information is produced with each coin flip. If the coin were biased, the entropy would be lower because the outcome would be more predictable. On the other hand, adding more information increases entropy, akin to the second law of thermodynamics which states that entropy in a system will always increase over time.

Shannon’s Channel Capacity

A significant aspect of Shannon’s theory is determining the maximum rate at which information can be transmitted over a communication channel, known as the channel capacity. Shannon’s Channel Capacity Theorem states:

$C = B \log_2 (1 + \frac{S}{N})$

where $C$ is the channel capacity in bits per second, $B$ is the bandwidth of the channel in hertz, and $\frac{S}{N}$ is the signal-to-noise ratio (SNR), which typically with terrestrial and commercial communications systems is much greater than 1. This allows us to rewrite the equation as follows, where we express SNR in decibels :

$C = B\frac{\log_{10}(\frac{S}{N})}{\log_{10}(2)}=B\frac{10\log_{10}(\frac{S}N).}{10\log_{10}(2)}\approx B(\frac{SNR}{3})$

This theorem implies that for a given bandwidth and signal-to-noise ratio, there is a theoretical limit to the amount of information that can be transmitted reliably. Thus, what engineers can do is try to approch this limit by mitigating the effects of noise through error-correcting codes.

One particular example of such codes is Hamming codes, published by Richard W. Hamming in 1950 at Bell Labs. This method uses parity bits to confirm whether the original message has changed during transmission, where the parity bits indicate whether the number of ones in the data is even or odd. If an odd number of bits is altered during transmission, the parity will change, allowing the detection of errors.

We’ll consider a simple Hamming code with 4 data bits $d$ and 3 parity bits $p$ , often referred to as a (7,4) Hamming code. This can be represented by two matrices, the generator matrix $G$ and the parity-check matrix $H$ .

The generator matrix for a (7,4) Hamming code is :

This matrix is used to encode the 4-bit message into a 7-bit codeword. The first 4 columns is the identity matrix, which represent the original message, while the remaining three columns represents parity check bits, which in this case represent the following equations.

$p_1 = d_1 + d_2 + d_4$
$p_2 = d_1 + d_3 + d_4$
$p_3 = d_2 + d_3 + d_4$

There is also another equation namely $p = d_1 + d_2 + d_3$ which could be used however this is not required here since any of the 3 equations, which are all linearly independent, is enough to uniqely identify any bit flips.

The parity-check matrix for the same code has the property $c\cdot{H^T}=0$ (i.e. the null matrix), and in this case is considered as :

This matrix is used to check for errors in the received codeword by returning a syndrome vector $s$ . As the error pattern (i.e. the bit flip) will be a vector combination of the original message and the erroneously recieved message, by finding the error pattern through the syndrome vector which we will demonstrate below we are able to recover the original message.

Encoding Process

Here, let’s encode a 4-bit message, for example, $\mathbf{m} = (0\ 0\ 1\ 0)$ .

To encode the message, we multiply it by the generator matrix :

$\mathbf{c} = \mathbf{m} \cdot G$

Calculating the product:

Thus, the encoded codeword is $(0\ 0\ 1\ 0\ 0\ 1\ 1)$ .

Decoding and Error Correction

Suppose the received codeword $\mathbf{r}$ is $(0\ 0\ 1\ 1\ 0\ 1\ 1)$ , which has an error in the 4th bit. To check for errors, we multiply the received codeword by the transpose of the parity-check matrix :

$\mathbf{s} = \mathbf{r} \cdot H^T$

Calculating the product:

The syndrome $\mathbf{s}$ indicates the position of the error. In this case, the syndrome $\mathbf{s} = (1, 1, 1)$ maps to the 4th bit, indicating an error there. Correcting this bit, we get the original codeword $(0\ 0\ 1\ 0\ 0\ 1\ 1)$ , which discarding the parity bits returns us the original message $(0\ 0\ 1\ 0)$ . This can be pre-calculated to immediately identify which bit is the cause for the disrepectancy – this is left as an exercise for the readers.

Applications and Implications

Shannon’s information theory has vast applications in modern technology and science:

Data Compression: Entropy is a fundamental concept in data compression algorithms, such as Huffman coding and the Lempel-Ziv-Welch (LZW) algorithm, which aim to reduce the size of data without losing information.
Cryptography: Information theory provides tools for analyzing the security of cryptographic systems, helping to ensure that encryption methods are robust against various types of attacks.
Telecommunications: Shannon’s work underpins the development of efficient coding and modulation schemes, enabling reliable data transmission over noisy channels.
Machine Learning: Concepts of entropy and information gain are integral to decision trees and other machine learning algorithms, influencing how models learn from data.

Claude Shannon’s mathematical theory of communications laid the foundation for the digital age, transforming our understanding of information and its transmission. By introducing the concepts of entropy and channel capacity, Shannon provided the tools necessary to quantify and optimize communication systems. His work continues to influence a wide array of fields, from computer science and engineering to biology and economics, demonstrating the enduring power and versatility of information theory.

p.s. : As I will be on vacation for the next month, there will not be an article for August – instead, enjoy a collection of articles and essays I have written over the past year!

Link to Original Text : https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf ↩︎

Information Theory and Entropy – Shannon’s Mathematical Theory of Communications