Homework 3
Median Minimizes the Sum of Absolute Deviations
Let \(S = \{S_1, S_2, \dots, S_n\}\) be a set of real numbers, sorted in increasing order so that \(S_1 \leq S_2 \leq \dots \leq S_n\).
We wish to show that the median of this set minimizes the sum of absolute deviations from any \(x \in \mathbb{R}\). Formally, we want to minimize the objective function:
Proof
Pairing Elements
Let us start by considering the simplest case with two elements, \(S_1\) and \(S_n\). The objective becomes:
\[f(x) = |x - S_1| + |x - S_n|\]Case 1: \(x < S_1\)
If \(x\) is smaller than both \(S_1\) and \(S_n\), the distances are:
\[f(x) = -(x - S_1) - (x - S_n) = S_1 + S_n - 2x\]As \(x\) decreases, the sum of absolute deviations increases, so \(x = S_1\) minimizes the expression in this case.
Case 2: \(S_1 \leq x \leq S_n\)
In this case, the absolute values reduce to:
\[f(x) = (x - S_1) + (S_n - x) = S_n - S_1\]This expression is constant, meaning \(f(x)\) is minimized for any \(x \in [S_1, S_n]\).
Case 3: \(x > S_n\)
If \(x\) is larger than both \(S_1\) and \(S_n\), the distances are:
\[f(x) = (x - S_1) + (x - S_n) = 2x - S_1 - S_n\]As \(x\) increases, the sum of absolute deviations increases, so \(x = S_n\) minimizes the expression in this case.
General Case: \(n\)
Now consider the general case where \(S = \{S_1, S_2, \dots, S_n\}\). We proceed by recursively pairing the smallest and largest elements.
-
Pairing the smallest and largest elements: Consider the pair \(S_1\) and \(S_n\). From the analysis of two elements, we know that the sum of absolute deviations for this pair is minimized when \(S_1 \leq x \leq S_n\).
-
Remove the paired elements: After determining that \(S_1 \leq x \leq S_n\) minimizes the deviations for \(S_1\) and \(S_n\), we remove these elements from the set and repeat the process with the next smallest and largest pair \(S_2\) and \(S_{n-1}\).
-
Continue the process: We continue this process of pairing and removing elements until we are left with at most one element.
In each step, we narrow the range in which \(x\) must lie to minimize the sum of absolute deviations. -
Final element or pair:
- If \(n\) is odd, the final step will leave one element \(S_k\), which is the median, and \(x = S_k\) will trivially minimize the sum of deviations for this element.
- If \(n\) is even, the process ends with no elements remaining, but in each prior step, we constrained \(x\) to lie between paired elements.
The final result is that \(x\) must lie between \(S_{\frac{n}{2}}\) and \(S_{\frac{n}{2}+1}\), which means the median (or any value between these two) minimizes the sum of absolute deviations.
Thus, in both the odd and even cases, the median minimizes the sum of absolute deviations. This is because the median always lies between the smallest and largest pairs as we progressively narrow down the set. The absolute deviation sum is constant when \(x\) is the median or within the bounds determined by the median in the even case.
Location or Central Tendency
Defining a “location” statistic (or central tendency) for a distribution involves selecting a representative value that characterizes the center or typical value of the data.
There are many ways to define central tendency, and the concept can be generalized in countless ways depending on the assumptions and properties of the data or the specific context.
- Mean: The sum of all values divided by the total number of values. It is also called the “Arithmetic Mean” or “Arithmetic Average”.
def mean(data):
n = len(data)
return sum(data)/n
-
Median: The middle value when the data is arranged in ascending order.
if the data has a even number of values the median is the average of the two middle values
def median(data): n = len(data) if n % 2 == 0: # If the data has an even number of values median_value = (data[n//2] + data[n//2 + 1]) / 2 else: # If the data has an odd number of values median_value = data[n//2] return median_value
-
Mode: The value that appears most frequently in the dataset.
def find_mode(ls):
dic = {}
max_count = 0
mode = None
for val in ls:
dic[val] += 1
if dic[val] > max_count:
max_count = dic[val]
mode = val
return mode
The mode can be generalized to multimodal distributions, where there can be multiple modes, or to kernel density estimations that define modes over continuous distributions.
- Geometric Mean: The geometric mean is calculated by taking the product of all values together and then taking the n-th root of that product, where n is the number of values.
def geometricMean(ls):
n = len(ls)
product=1
for val in ls:
product*=val
return product**(1/n)
This is used for multiplicative relationships and can be generalized to power means or by incorporating different exponents into the product or using transformations of the data.
- Harmonic Mean: The harmonic mean is calculated by taking the total number of values in the set and dividing that by the sum of the reciprocals of each value in the set
def harmonicMean(ls):
somma = 0
for val in ls:
somma += 1/val
return len(test_list)/somma
Harmonic means are useful when the data involves rates. These can be generalized to weighted harmonic means or combined with other methods, such as using higher-order reciprocals.
-
M-Estimators: A robust class of estimators that generalizes the mean by minimizing a loss function.
Solve for \(\hat{\theta}\) that minimizes \(\sum_{i=1}^{n} \rho(x_i - \theta)\), where \(\rho\) is a chosen function.
Different choices of \(\rho\) (e.g., squared loss, absolute loss, Huber loss) yield different central tendencies.
These estimators can be further generalized by using complex or adaptive loss functions.
Infinite Potential Generalizations:
- Generalized Mean (Power Mean): A family of means where different powers are applied to the data.
Varying the parameter \(p\) produces different types of means:
- p = 1: Arithmetic mean
- p = 0: Geometric mean
- p = -1: Harmonic mean
This family can be extended to negative or fractional values of \(p\), each representing a different concept of centrality.
-
Choosing different norms: Using norms other than the Euclidean norm allows you to define different means in vector spaces, such as \(L^p\)-norms for generalized means.
-
Changing distance metrics: Using alternative distance metrics (e.g., Manhattan distance) in M-estimators calculations leads to new definitions of central tendency.
Practice
HW3 - Stochastic Process Modeling
References
- https://math.stackexchange.com/questions/113270/the-median-minimizes-the-sum-of-absolute-deviations-the-ell-1-norm