Homework 8

6 minute read

Shannon Entropy and Diversity Measures

Shannon entropy and other diversity measures quantify the uncertainty, randomness, or diversity present in probability distributions. These measures provide critical tools for analyzing distributions across various fields, from information theory to ecology and economics.

1. Measures of Distributions

Shannon Entropy

Shannon entropy, introduced by Claude Shannon in 1948, is the foundational measure of uncertainty or randomness in a probability distribution.
For a discrete probability distribution $P = {p_1, p_2, \ldots, p_n}$ where $\sum_{i=1}^n p_i = 1$, Shannon entropy is defined as:

\[H(P) = -\sum_{i=1}^n p_i \log p_i\]

The logarithm base determines the units: base 2 gives bits, base $e$ gives nats, and base 10 gives dits. Shannon entropy reaches its maximum value of $\log n$ when all probabilities are equal ($p_i = 1/n$), representing maximum uncertainty. Conversely, when one outcome is certain ($p_i = 1$ for some $i$), entropy is zero, as there is no uncertainty.

Applications:

  • Information Theory: Quantifies the average number of bits required to encode data
  • Ecology: Measures species diversity in ecosystems
  • Machine Learning: Underlies decision tree algorithms for assessing data split purity

For continuous distributions with probability density function $f(x)$, differential entropy is:

\[H(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) , dx\]

Rényi Entropy

Rényi entropy generalizes Shannon entropy with parameter $\alpha \geq 0, \alpha \neq 1$, providing flexibility for adjusting sensitivity to rare versus common events:

\[H_\alpha(P) = \frac{1}{1-\alpha} \log \left(\sum_{i=1}^n p_i^\alpha\right)\]

Special cases include:

  • $H_0(P) = \log n$ (Hartley entropy - depends only on support size)
  • $H_1(P) = H(P)$ (Shannon entropy, taking the limit as $\alpha \to 1$)
  • $H_2(P) = -\log\left(\sum_{i=1}^n p_i^2\right)$ (collision entropy)
  • $H_\infty(P) = -\log(\max_i p_i)$ (min-entropy)

Simpson’s Diversity Index

Originally from ecology, Simpson’s index measures the probability that two randomly selected individuals belong to different categories, emphasizing the dominance of species:

\[D = 1 - \sum_{i=1}^n p_i^2\]

The related Simpson’s concentration index is $\lambda = \sum_{i=1}^n p_i^2$. Lower values of $D$ indicate higher dominance and less diversity.

Gini-Simpson Index

Equivalent to Simpson’s diversity index but often used in different contexts:

\[\text{GS} = 1 - \sum_{i=1}^n p_i^2\]

This ranges from 0 (no diversity) to $1 - 1/n$ (maximum diversity).

Gini Coefficient

The Gini coefficient measures inequality within a distribution, particularly useful in economics for income or wealth distributions:

\[G = 1 - \sum_{i=1}^n \sum_{j=1}^n \frac{|p_i - p_j|}{2}\]

A Gini coefficient of 0 indicates perfect equality, while a coefficient of 1 indicates maximal inequality.

Tsallis Entropy

A one-parameter generalization with parameter $q \neq 1$:

\[S_q(P) = \frac{1}{q-1}\left(1 - \sum_{i=1}^n p_i^q\right)\]

As $q \to 1$, Tsallis entropy approaches Shannon entropy. Different values of $q$ emphasize different aspects of the distribution.

Hill Numbers (Effective Number of Species)

Hill numbers provide an intuitive interpretation by converting diversity measures into “effective number of equally abundant categories”, focusing on both species richness and abundance:

\[^qD = \left(\sum_{i=1}^n p_i^q\right)^{\frac{1}{1-q}}\]

Where:

  • $q = 0$: Species richness (total number of categories)
  • $q = 1$: $\exp(H(P))$ (exponential of Shannon entropy)
  • $q = 2$: $\frac{1}{\sum_{i=1}^n p_i^2}$ (inverse Simpson concentration)

Mutual Information

For joint distributions, mutual information measures dependence between variables $X$ and $Y$:

\[I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

This can be expressed as $I(X;Y) = H(X) + H(Y) - H(X,Y)$, representing the reduction in uncertainty about one variable given knowledge of the other.

Comparison of Measures

While these diversity measures often overlap in their goals, they emphasize different properties:

  • Shannon Entropy: Suited for assessing randomness and uncertainty
  • Gini Coefficient: Ideal for quantifying inequality in wealth or resource distributions
  • Simpson’s Index: Emphasizes the dominance of a few categories in a distribution
  • Rényi Entropy and Hill Numbers: Offer flexibility for adjusting sensitivity to rare events

The choice of measure depends on the context and the specific aspects of diversity or randomness being analyzed.

2. Primitive Roots

A primitive root modulo a prime $p$ is a fundamental concept in number theory that characterizes generators of the multiplicative group modulo $p$, with important applications in cryptography and discrete mathematics.

Definition

Let $p$ be a prime number. An integer $g$ is called a primitive root modulo $p$ if for every integer $a$ that is coprime to $p$ (i.e., $\gcd(a,p) = 1$), there exists an integer $k$ such that:

\[g^k \equiv a \pmod{p}\]

Equivalent Characterizations

A primitive root $g$ modulo prime $p$ can be equivalently defined as:

  1. Order characterization: $g$ has multiplicative order $p-1$ modulo $p$, meaning $(p-1)$ is the smallest positive integer $d$ such that $g^d \equiv 1 \pmod{p}$.
  2. Generator characterization: $g$ generates the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$, which has order $\phi(p) = p-1$.
  3. Exponential characterization: The set ${g^0, g^1, g^2, \ldots, g^{p-2}}$ modulo $p$ equals ${1, 2, 3, \ldots, p-1}$.

Key Properties

Existence: Every prime $p$ has primitive roots. In fact, there are exactly $\phi(p-1)$ primitive roots modulo $p$, where $\phi$ is Euler’s totient function.
Structure: If $g$ is a primitive root modulo $p$, then $g^k$ is also a primitive root modulo $p$ if and only if $\gcd(k, p-1) = 1$.
Discrete logarithm: If $g$ is a primitive root modulo $p$, then for any $a \not\equiv 0 \pmod{p}$, the unique integer $k$ with $0 \leq k \leq p-2$ such that $g^k \equiv a \pmod{p}$ is called the discrete logarithm of $a$ to base $g$ modulo $p$, denoted $\log_g a$.

Examples

Modulo 5: Consider $p = 5$. The multiplicative group is ${1, 2, 3, 4}$ with order $4$.

  • $g = 2$: Powers are $2^1 \equiv 2$, $2^2 \equiv 4$, $2^3 \equiv 3$, $2^4 \equiv 1 \pmod{5}$
  • Since ${2^1, 2^2, 2^3, 2^4} = {2, 4, 3, 1} = {1, 2, 3, 4}$, we have that $2$ is a primitive root modulo $5$.
    Modulo 7: Consider $p = 7$. The multiplicative group is ${1, 2, 3, 4, 5, 6}$ with order $6$.
  • $g = 3$: Powers are $3^1 \equiv 3$, $3^2 \equiv 2$, $3^3 \equiv 6$, $3^4 \equiv 4$, $3^5 \equiv 5$, $3^6 \equiv 1 \pmod{7}$
  • Since all non-zero residues appear, $3$ is a primitive root modulo $7$.

Finding Primitive Roots

To verify if $g$ is a primitive root modulo prime $p$:

  1. Check that $g^{(p-1)/q} \not\equiv 1 \pmod{p}$ for every prime divisor $q$ of $p-1$.
  2. Equivalently, verify that the order of $g$ is exactly $p-1$.

Applications

Cryptography: Primitive roots are crucial in:

  • Diffie-Hellman key exchange protocols
  • ElGamal encryption schemes
  • Digital signature algorithms
  • The security relies on the difficulty of solving the discrete logarithm problem

Quadratic residues: If $g$ is a primitive root modulo odd prime $p$, then $a$ is a quadratic residue modulo $p$ if and only if the discrete logarithm $\log_g a$ is even.
Random number generation: Primitive roots are used in algorithms for generating pseudo-random numbers in modular arithmetic systems.
Signal processing: In finite fields, primitive roots are used in designing efficient algorithms for fast Fourier transforms.

The concept extends beyond primes to other moduli, but the structure becomes more complex. For prime powers $p^k$ and certain composite numbers, primitive roots may exist, but for general composite numbers, the multiplicative group structure is more intricate.

Practice

HW8 - Statistical Analysis

References