Homework 1

13 minute read

What is Statistics?

Statistics is a branch of mathematics that involves collecting, organizing, analyzing, interpreting, and presenting data. The goal of statistics is to extract meaningful insights and draw valid conclusions from data.

Basic Notions in Statistics

Population: in statistics, the population represents the complete set of all elements (units) being studied. The statistical population encompasses all measurements of each unit across the entire population for which data is accessible.
Statistical Units: A unit refers to an individual object or person whose attributes are being examined. For each unit, a variety of selected attributes are examined based on the research focus.
Distribution: refers to the way in which data values are spread or arranged across a given range. It provides a summary of how often different values or ranges of values occur in a dataset. The distribution helps identify patterns, trends, and characteristics of the data, and it is crucial for statistical analysis and interpretation.

Averages in Statistics

In statistics, we use specific measures to calculate averages, and these are collectively called measures of central tendency. The most common measures of central tendency are:

Mean: The sum of all values divided by the total number of values. It is also called the “Arithmetic Mean” or “Arithmetic Average”.
```
   def mean(data):
       n = len(data)
       return sum(data)/n
```

Median: The middle value when the data is arranged in ascending order. if the data has a even number of values the median is the average of the two middle values

  def median(data):
      n = len(data)
      if n % 2 == 0: # If the data has an even number of values
          median_value = (data[n//2] + data[n//2 + 1]) / 2 
      else: # If the data has an odd number of values
          median_value = data[n//2]
      return median_value

Mode: The value that appears most frequently in the dataset.

  def find_mode(ls):
      dic = {}
      max_count = 0
      mode = None
      for val in ls:
          dic[val] += 1
          if dic[val] > max_count:
              max_count = dic[val]
              mode = val
      return mode

Geometric Mean: The geometric mean is calculated by taking the product of all values together and then taking the n-th root of that product, where n is the number of values.

\[\left( \prod_{i=1}^{n} x_{i} \right)^{\frac{1} {n}}=\root n \of{x_{1} x_{2} \cdots x_{n}}\]

def geometricMean(ls):
	n = len(ls)
    product=1
    for val in ls:
        product*=val
    return product**(1/n)

Harmonic Mean: The harmonic mean is calculated by taking the total number of values in the set and dividing that by the sum of the reciprocals of each value in the set

\[\frac{n} {\sum_{i=1}^{n} {\frac{1} {x_{i}}}}\]

def harmonicMean(ls):
	somma = 0
	for val in ls:
	    somma += 1/val   
	return len(test_list)/somma

Floating Point Representation: Errors and Numerical Stability

For statisticians, recognizing errors in computation is vital for ensuring the accuracy and reliability of their analyses.

Floating-point representation inherently involves approximating real numbers, which leads to unavoidable inaccuracies. These errors arise from the limitation of using a finite number of bits to represent an infinite range of real numbers.

The concept of significant figures, or more accurately, relative error, is key to understanding the behavior of floating-point arithmetic. Relative error is defined as:

\[relativeError = \frac{approximateValue - trueValue}{trueValue}\]

This measure quantifies the inaccuracy in the representation of a number.
In floating-point arithmetic, due to limited precision, small errors can accumulate and potentially grow as more operations are performed.
Floating-point arithmetic can encounter specific problems:

Rounding Errors: Rounding errors arise from the limited precision in representing real numbers with finite bits, preventing exact representation of certain values like 0.1 in binary format. The decimal number 0.1 cannot be represented exactly in binary because its binary equivalent results in an infinite repeating fraction.

\[0.1_{10} = 0.0001\overline{1001}_2\]

Due to the limited number of bits used in floating-point representation, this repeating pattern cannot be stored precisely.

Propagation of Errors: In iterative algorithms, minor errors can accumulate significantly, affecting accuracy. Conducting stability analysis is crucial to understand how errors propagate and assess an algorithm’s reliability in handling small input variations.
Subtraction of nearly equal values: Subtraction of nearly equal values can greatly amplify the relative error. This phenomenon is known as catastrophic cancellation. Consider the subtraction of two nearly equal numbers, such as

\[x = 99999993.0 - 99999991.0.\]

If both numbers are rounded in the floating point representation, x may end up being 0 or a value very close to 0, resulting in a significant loss of precision in the final result.

Overflow & Underflow: Floating-point formats have limitations on exponent ranges, causing overflow (when results exceed maximum values) and underflow (when results approach zero). Special values like “Inf” or “NaN” are used to represent overflow, while unnormalized representations may indicate the magnitude of underflow.

An algorithm is considered numerically stable if the small errors introduced during computation do not grow to distort the final result. In other words, stable algorithms yield accurate results even when small intermediate errors occur.
In his book series “The Art of Computer Programming,” Donald Knuth offered thorough discussions on the challenges of numerical computation and methods for reducing errors in Volume 2, Chapter titled “ARITHMETIC.”
Donald Knuth provides various suggestions for enhancing the numerical stability of algorithms:

Use correct rounding at critical points to reduce errors.
Prefer tolerance-based comparisons over strict equality when dealing with floating-point numbers.
Avoid premature rounding during calculations to prevent error accumulation.
Avoid Catastrophic Cancellation: reformulate expressions to minimize the risk of subtracting nearly equal numbers. For instance, can aid in maintaining precision expressing $a - b$ as

\[(a - b) \cdot \frac{(a + b)}{(a + b)}\]

Exploiting Mathematical Properties: Knuth emphasizes the importance of using simple mathematical laws to enhance the stability of floating-point operations by exploiting properties like symmetry and regularity. One notable method is rounding to even, which helps minimize bias and error accumulation in both binary and decimal systems.
Empirical Testing: Knuth emphasizes the need for empirical testing and validation, which includes stress testing algorithms with a variety of inputs to identify weaknesses and comparing results from different methods to ensure accuracy and stability.

Maximize precision during intermediate steps to minimize rounding issues, for example methods like Kahan Summation Algorithm, help minimize rounding errors that occur when adding a series of numbers; the loop iterates through each element in the input list, adjusting for any potential loss of precision.

  def kahan_sum(ls): 
      somma = 0.0 
      c = 0.0 # the compensation of lost low-order bits
      n = len(ls)
      for i in range(n):
          y = input[i] - c
          t = somma + y
          c = (t - somma) - y
          somma = t
      return somma

Other techniques such as unnormalized arithmetic and interval arithmetic can help manage errors, but they must be applied with care.

Practice: Cyber Attack Simulation

The goal of this simulation is to illustrate the distribution and statistical behavior of an attack, where multiple hackers, each with an equal probability, have the potential to compromise a server. The simulation models this scenario across n servers, m hackers, and a probability p that a hacker successfully penetrates a server.

The simulation generates two distinct graphs:

First graph: Each line represents an individual hacker. The y-axis displays the number of servers violated, while the x-axis represents time. Each time a hacker successfully compromises a server, the line moves from 0 (no violation) to 1 (successful violation).
Second graph: A histogram depicting the distribution of violations at the conclusion of the simulation. This graph shows the frequency with which a specific number of hackers successfully compromise a given range of servers. It provides an overview of the final distribution of server compromises, illustrating how frequently hackers reach various levels of server penetration.

Online Implementation

Online Attack Simulation

Main Function

function generateGraphs() {
    const n_servers = parseInt(document.getElementById('n').value);
    const n_hackers = parseInt(document.getElementById('m').value);
    const p = parseFloat(document.getElementById('p').value);

    // Validate input
    if (n_servers < 2 || n_servers > 500 || n_hackers < 1 || n_hackers > 500 || p < 0 || p > 1) {
        alert('Please enter values in these intervals:\n- 2-500 for servers\n- 1-500 for hackers\n- 0-1 for probability.');
        return;
    }

    const servers = [];
    for (let i = 1; i <= n_servers; i++) { servers.push(i); }

    const hackers = [];
    for (let i = 1; i <= n_hackers; i++) { hackers.push(i); }

    const breachDatasets = getDataset(hackers.length, servers.length, p);

    const serversCount = servers.length;
    intervalSize = 1
    if (serversCount >= 10) {
        intervalSize = 2
    } else if (serversCount >= 20) {
        intervalSize = 3
    } else if (serversCount >= 50) {
        intervalSize = 4
    } else if (serversCount >= 100) {
        intervalSize = 5
    } else if (serversCount >= 200) {
        intervalSize = 6
    } else if (serversCount >= 300) {
        intervalSize = 7
    } else if (serversCount >= 400) {
        intervalSize = 8
    }
    const intervalCount = Math.ceil(serversCount / intervalSize);

    const intervals = Array.from({ length: intervalCount }, (_, i) => `${i * intervalSize}-${(i + 1) * intervalSize}`);
    const distributionCounts = Array(intervalCount).fill(0);
    breachDatasets.forEach(dataset => {
        const index = Math.floor(dataset.data.at(-1) / intervalSize);
        if (index >= 0 && index < intervalCount) { distributionCounts[index]++; }
    });

    while (distributionCounts[0] === 0) {
        distributionCounts.shift();
        intervals.shift();
    }
    while (distributionCounts.at(-1) === 0) {
        distributionCounts.pop();
        intervals.pop();
    }

    plotFlatLineChart(servers, hackers, breachDatasets);
    plotHistogram(intervals, distributionCounts);

    document.getElementById('charts').style.display = 'flex';
}

Input Parsing: The function starts by retrieving the values from the HTML input fields for n_servers (number of servers), n_hackers (number of hackers), and p (probability of a successful breach). These values are parsed and stored in variables:
```
const n_servers = parseInt(document.getElementById('n').value);
const n_hackers = parseInt(document.getElementById('m').value);
const p = parseFloat(document.getElementById('p').value);
```

Input Validation: It checks if the input values are within the acceptable ranges. If n_servers or n_hackers are outside the ranges (2 to 500 for servers and 1 to 500 for hackers), or if p is not between 0 and 1, the function issues an alert and exits:

if (n_servers < 2 || n_servers > 500 || n_hackers < 1 || n_hackers > 500 || p < 0 || p > 1) {
   alert('Please enter values in these intervals:\n- 2-500 for servers\n- 1-500 for hackers\n- 0-1 for probability.');
   return;
}

Array Population: It creates two arrays, servers and hackers, each populated with sequential numbers representing server and hacker IDs:

const servers = [];
for (let i = 1; i <= n_servers; i++) { servers.push(i); }

const hackers = [];
for (let i = 1; i <= n_hackers; i++) { hackers.push(i); }

Dataset Generation: The function then calls getDataset() to generate a dataset of simulated breaches. It passes the number of hackers, servers, and the probability p as arguments. The resulting dataset is stored in breachDatasets:
```
const breachDatasets = getDataset(hackers.length, servers.length, p);
```

Interval Count Determination: The size of intervals (intervalSize) depends on thresholds (e.g., if breaches are greater than 50, 100, 200, etc.), different size of intervals are used to provide meaningful visualization:

 const serversCount = servers.length;
 intervalSize = 1
 if (serversCount >= 10) {
     intervalSize = 2
 } else if (serversCount >= 20) {
     intervalSize = 3
 } else if (serversCount >= 50) {
     intervalSize = 4
 } else if (serversCount >= 100) {
     intervalSize = 5
 } else if (serversCount >= 200) {
     intervalSize = 6
 } else if (serversCount >= 300) {
     intervalSize = 7
 } else if (serversCount >= 400) {
     intervalSize = 8
 }
// Other conditions omitted for brevity...

Interval Size Calculation: The interval number is calculated by dividing serversCount by intervalSize, which determines the range of breaches each interval will cover. An array intervals is created, with each entry representing the lower and upper bounds of each interval:
```
 const intervalCount = Math.ceil(serversCount / intervalSize);
 const intervals = Array.from({ length: intervalCount }, (_, i) => `${i * intervalSize}-${(i + 1) * intervalSize}`);
```

Data Binning: The function initializes an array distributionCounts to store the number of breaches that fall within each interval. It then iterates through breachDatasets and determines which interval each data point belongs to. The count for that interval is incremented:

const distributionCounts = Array(intervalCount).fill(0);
breachDatasets.forEach(dataset => {
   const index = Math.floor(dataset.data.at(-1) / intervalSize);
   if (index >= 0 && index < intervalCount) { distributionCounts[index]++; }
});

Empty Interval Removal: If any of the leading intervals in distributionCounts are empty (i.e., have zero breaches), they are removed from the intervals array to avoid displaying unnecessary empty bins in the histogram:
```
while (distributionCounts[0] === 0) {
   distributionCounts.shift();
   intervals.shift();
}
 while (distributionCounts.at(-1) === 0) {
 distributionCounts.pop();
 intervals.pop();
 }
```
Graph Plotting: The function finally calls two plotting functions, plotFlatLineChart() and plotHistogram(), to visualize the data. It passes the servers, hackers, and breachDatasets arrays to the first function and the intervals and distributionCounts arrays to the second. The results are displayed in the DOM element with ID charts:
```
plotFlatLineChart(servers, hackers, breachDatasets);
plotHistogram(intervals, distributionCounts);
document.getElementById('charts').style.display = 'flex';
```

Dataset Generation

function getDataset(hackers, servers, p) {
    let dataset = [];

    for (let i = 1; i <= hackers; i++) {
        let hackerInfo = [];
        let top = 0;
        for (let j = 0; j < servers; j++) {
            let x = top;
            if (p > Math.random()) { x = ++top; }
            hackerInfo.push(x);
        }

        dataset.push({
            data: hackerInfo,
            label: `Hacker ${i}`,
            borderColor: generateRandomColor(),
        });
    }

    return dataset;
}

The function genDataset takes three parameters: hackers, servers, and p.
An empty array dataset is initialized to hold the generated data.
A loop runs for each hacker, creating an empty array hackerInfo for each one.
The variable top is initialized to 0 to track “successes.”
For each hacker, another loop iterates over the servers.
In each iteration, p is compared to a random value generated by Math.random().
If p is greater than the random value, top is incremented.
Each result (x) is stored in the hackerInfo array.
After the loop for a hacker, the data is pushed into dataset, with a label and a color.
The function returns the dataset array containing the information for all hackers.

References

Knuth, Donald. The Art of Computer Programming Volume 2. Addison-Wesley, 1997.
Tanvir Mustafy, Md. Tauhid Ur Rahman. Statistics and Data Analysis for Engineers and Scientists, Springer, 2024.

Share on

Twitter Facebook LinkedIn

Raffaele Ruggeri