Homework 1
What is Statistics?
Statistics is a branch of mathematics that involves collecting, organizing, analyzing, interpreting, and presenting data. The goal of statistics is to extract meaningful insights and draw valid conclusions from data.
Basic Notions in Statistics
- Population: in statistics, the population represents the complete set of all elements (units) being studied. The statistical population encompasses all measurements of each unit across the entire population for which data is accessible.
- Statistical Units: A unit refers to an individual object or person whose attributes are being examined. For each unit, a variety of selected attributes are examined based on the research focus.
- Distribution: refers to the way in which data values are spread or arranged across a given range. It provides a summary of how often different values or ranges of values occur in a dataset. The distribution helps identify patterns, trends, and characteristics of the data, and it is crucial for statistical analysis and interpretation.
Averages in Statistics
In statistics, we use specific measures to calculate averages, and these are collectively called measures of central tendency. The most common measures of central tendency are:
- Mean: The sum of all values divided by the total number of values. It is also called the “Arithmetic Mean” or “Arithmetic Average”.
def mean(data): n = len(data) return sum(data)/n
- Median: The middle value when the data is arranged in ascending order.
if the data has a even number of values the median is the average of the two middle values
def median(data): n = len(data) if n % 2 == 0: # If the data has an even number of values median_value = (data[n//2] + data[n//2 + 1]) / 2 else: # If the data has an odd number of values median_value = data[n//2] return median_value
- Mode: The value that appears most frequently in the dataset.
def find_mode(ls): dic = {} max_count = 0 mode = None for val in ls: dic[val] += 1 if dic[val] > max_count: max_count = dic[val] mode = val return mode
Geometric Mean: The geometric mean is calculated by taking the product of all values together and then taking the n-th root of that product, where n is the number of values.
def geometricMean(ls):
n = len(ls)
product=1
for val in ls:
product*=val
return product**(1/n)
- Harmonic Mean: The harmonic mean is calculated by taking the total number of values in the set and dividing that by the sum of the reciprocals of each value in the set
def harmonicMean(ls):
somma = 0
for val in ls:
somma += 1/val
return len(test_list)/somma
Floating Point Representation: Errors and Numerical Stability
For statisticians, recognizing errors in computation is vital for ensuring the accuracy and reliability of their analyses.
Floating-point representation inherently involves approximating real numbers, which leads to unavoidable inaccuracies. These errors arise from the limitation of using a finite number of bits to represent an infinite range of real numbers.
The concept of significant figures, or more accurately, relative error, is key to understanding the behavior of floating-point arithmetic. Relative error is defined as:
\[relativeError = \frac{approximateValue - trueValue}{trueValue}\]This measure quantifies the inaccuracy in the representation of a number.
In floating-point arithmetic, due to limited precision, small errors can accumulate and potentially grow as more operations are performed.
Floating-point arithmetic can encounter specific problems:
- Rounding Errors: Rounding errors arise from the limited precision in representing real numbers with finite bits, preventing exact representation of certain values like 0.1 in binary format. The decimal number 0.1 cannot be represented exactly in binary because its binary equivalent results in an infinite repeating fraction.
Due to the limited number of bits used in floating-point representation, this repeating pattern cannot be stored precisely.
- Propagation of Errors: In iterative algorithms, minor errors can accumulate significantly, affecting accuracy. Conducting stability analysis is crucial to understand how errors propagate and assess an algorithm’s reliability in handling small input variations.
- Subtraction of nearly equal values: Subtraction of nearly equal values can greatly amplify the relative error. This phenomenon is known as catastrophic cancellation. Consider the subtraction of two nearly equal numbers, such as
If both numbers are rounded in the floating point representation, x
may end up being 0
or a value very close to 0
, resulting in a significant loss of precision in the final result.
- Overflow & Underflow: Floating-point formats have limitations on exponent ranges, causing overflow (when results exceed maximum values) and underflow (when results approach zero). Special values like “Inf” or “NaN” are used to represent overflow, while unnormalized representations may indicate the magnitude of underflow.
An algorithm is considered numerically stable if the small errors introduced during computation do not grow to distort the final result. In other words, stable algorithms yield accurate results even when small intermediate errors occur.
In his book series “The Art of Computer Programming,” Donald Knuth offered thorough discussions on the challenges of numerical computation and methods for reducing errors in Volume 2, Chapter titled “ARITHMETIC.”
Donald Knuth provides various suggestions for enhancing the numerical stability of algorithms:
- Use correct rounding at critical points to reduce errors.
- Prefer tolerance-based comparisons over strict equality when dealing with floating-point numbers.
- Avoid premature rounding during calculations to prevent error accumulation.
- Avoid Catastrophic Cancellation: reformulate expressions to minimize the risk of subtracting nearly equal numbers. For instance, can aid in maintaining precision expressing \(a - b\) as
- Exploiting Mathematical Properties: Knuth emphasizes the importance of using simple mathematical laws to enhance the stability of floating-point operations by exploiting properties like symmetry and regularity. One notable method is rounding to even, which helps minimize bias and error accumulation in both binary and decimal systems.
- Empirical Testing: Knuth emphasizes the need for empirical testing and validation, which includes stress testing algorithms with a variety of inputs to identify weaknesses and comparing results from different methods to ensure accuracy and stability.
- Maximize precision during intermediate steps to minimize rounding issues, for example methods like Kahan Summation Algorithm, help minimize rounding errors that occur when adding a series of numbers; the loop iterates through each element in the input list, adjusting for any potential loss of precision.
def kahan_sum(ls): somma = 0.0 c = 0.0 # the compensation of lost low-order bits n = len(ls) for i in range(n): y = input[i] - c t = somma + y c = (t - somma) - y somma = t return somma
Other techniques such as unnormalized arithmetic and interval arithmetic can help manage errors, but they must be applied with care.
Practice: Cyber Attack Simulation
The goal of this simulation is to illustrate the distribution and statistical behavior of an attack, where multiple hackers, each with an equal probability, have the potential to compromise a server. The simulation models this scenario across n servers, m hackers, and a probability p that a hacker successfully penetrates a server.
The simulation generates two distinct graphs:
-
First graph: Each line represents an individual hacker. The y-axis displays the number of servers violated, while the x-axis represents time. Each time a hacker successfully compromises a server, the line moves from 0 (no violation) to 1 (successful violation).
-
Second graph: A histogram depicting the distribution of violations at the conclusion of the simulation. This graph shows the frequency with which a specific number of hackers successfully compromise a given range of servers. It provides an overview of the final distribution of server compromises, illustrating how frequently hackers reach various levels of server penetration.
Online Implementation
Main Function
function generateGraphs() {
const n_servers = parseInt(document.getElementById('n').value);
const n_hackers = parseInt(document.getElementById('m').value);
const p = parseFloat(document.getElementById('p').value);
// Validate input
if (n_servers < 2 || n_servers > 500 || n_hackers < 1 || n_hackers > 500 || p < 0 || p > 1) {
alert('Please enter values in these intervals:\n- 2-500 for servers\n- 1-500 for hackers\n- 0-1 for probability.');
return;
}
const servers = [];
for (let i = 1; i <= n_servers; i++) { servers.push(i); }
const hackers = [];
for (let i = 1; i <= n_hackers; i++) { hackers.push(i); }
const breachDatasets = getDataset(hackers.length, servers.length, p);
const serversCount = servers.length;
intervalSize = 1
if (serversCount >= 10) {
intervalSize = 2
} else if (serversCount >= 20) {
intervalSize = 3
} else if (serversCount >= 50) {
intervalSize = 4
} else if (serversCount >= 100) {
intervalSize = 5
} else if (serversCount >= 200) {
intervalSize = 6
} else if (serversCount >= 300) {
intervalSize = 7
} else if (serversCount >= 400) {
intervalSize = 8
}
const intervalCount = Math.ceil(serversCount / intervalSize);
const intervals = Array.from({ length: intervalCount }, (_, i) => `${i * intervalSize}-${(i + 1) * intervalSize}`);
const distributionCounts = Array(intervalCount).fill(0);
breachDatasets.forEach(dataset => {
const index = Math.floor(dataset.data.at(-1) / intervalSize);
if (index >= 0 && index < intervalCount) { distributionCounts[index]++; }
});
while (distributionCounts[0] === 0) {
distributionCounts.shift();
intervals.shift();
}
while (distributionCounts.at(-1) === 0) {
distributionCounts.pop();
intervals.pop();
}
plotFlatLineChart(servers, hackers, breachDatasets);
plotHistogram(intervals, distributionCounts);
document.getElementById('charts').style.display = 'flex';
}
- Input Parsing: The function starts by retrieving the values from the HTML input fields for
n_servers
(number of servers),n_hackers
(number of hackers), andp
(probability of a successful breach). These values are parsed and stored in variables:const n_servers = parseInt(document.getElementById('n').value); const n_hackers = parseInt(document.getElementById('m').value); const p = parseFloat(document.getElementById('p').value);
- Input Validation: It checks if the input values are within the acceptable ranges. If
n_servers
orn_hackers
are outside the ranges (2 to 500 for servers and 1 to 500 for hackers), or ifp
is not between 0 and 1, the function issues an alert and exits:if (n_servers < 2 || n_servers > 500 || n_hackers < 1 || n_hackers > 500 || p < 0 || p > 1) { alert('Please enter values in these intervals:\n- 2-500 for servers\n- 1-500 for hackers\n- 0-1 for probability.'); return; }
- Array Population: It creates two arrays,
servers
andhackers
, each populated with sequential numbers representing server and hacker IDs:const servers = []; for (let i = 1; i <= n_servers; i++) { servers.push(i); } const hackers = []; for (let i = 1; i <= n_hackers; i++) { hackers.push(i); }
- Dataset Generation: The function then calls
getDataset()
to generate a dataset of simulated breaches. It passes the number of hackers, servers, and the probabilityp
as arguments. The resulting dataset is stored inbreachDatasets
:const breachDatasets = getDataset(hackers.length, servers.length, p);
- Interval Count Determination: The size of intervals (
intervalSize
) depends on thresholds (e.g., if breaches are greater than 50, 100, 200, etc.), different size of intervals are used to provide meaningful visualization:const serversCount = servers.length; intervalSize = 1 if (serversCount >= 10) { intervalSize = 2 } else if (serversCount >= 20) { intervalSize = 3 } else if (serversCount >= 50) { intervalSize = 4 } else if (serversCount >= 100) { intervalSize = 5 } else if (serversCount >= 200) { intervalSize = 6 } else if (serversCount >= 300) { intervalSize = 7 } else if (serversCount >= 400) { intervalSize = 8 } // Other conditions omitted for brevity...
- Interval Size Calculation: The interval number is calculated by dividing
serversCount
byintervalSize
, which determines the range of breaches each interval will cover. An arrayintervals
is created, with each entry representing the lower and upper bounds of each interval:const intervalCount = Math.ceil(serversCount / intervalSize); const intervals = Array.from({ length: intervalCount }, (_, i) => `${i * intervalSize}-${(i + 1) * intervalSize}`);
- Data Binning: The function initializes an array
distributionCounts
to store the number of breaches that fall within each interval. It then iterates throughbreachDatasets
and determines which interval each data point belongs to. The count for that interval is incremented:const distributionCounts = Array(intervalCount).fill(0); breachDatasets.forEach(dataset => { const index = Math.floor(dataset.data.at(-1) / intervalSize); if (index >= 0 && index < intervalCount) { distributionCounts[index]++; } });
- Empty Interval Removal: If any of the leading intervals in
distributionCounts
are empty (i.e., have zero breaches), they are removed from theintervals
array to avoid displaying unnecessary empty bins in the histogram:while (distributionCounts[0] === 0) { distributionCounts.shift(); intervals.shift(); } while (distributionCounts.at(-1) === 0) { distributionCounts.pop(); intervals.pop(); }
- Graph Plotting: The function finally calls two plotting functions,
plotFlatLineChart()
andplotHistogram()
, to visualize the data. It passes theservers
,hackers
, andbreachDatasets
arrays to the first function and theintervals
anddistributionCounts
arrays to the second. The results are displayed in the DOM element with IDcharts
:plotFlatLineChart(servers, hackers, breachDatasets); plotHistogram(intervals, distributionCounts); document.getElementById('charts').style.display = 'flex';
Dataset Generation
function getDataset(hackers, servers, p) {
let dataset = [];
for (let i = 1; i <= hackers; i++) {
let hackerInfo = [];
let top = 0;
for (let j = 0; j < servers; j++) {
let x = top;
if (p > Math.random()) { x = ++top; }
hackerInfo.push(x);
}
dataset.push({
data: hackerInfo,
label: `Hacker ${i}`,
borderColor: generateRandomColor(),
});
}
return dataset;
}
- The function
genDataset
takes three parameters:hackers
,servers
, andp
. - An empty array
dataset
is initialized to hold the generated data. - A loop runs for each hacker, creating an empty array
hackerInfo
for each one. - The variable
top
is initialized to 0 to track “successes.” - For each hacker, another loop iterates over the servers.
- In each iteration,
p
is compared to a random value generated byMath.random()
. - If
p
is greater than the random value,top
is incremented. - Each result (
x
) is stored in thehackerInfo
array. - After the loop for a hacker, the data is pushed into
dataset
, with a label and a color. - The function returns the
dataset
array containing the information for all hackers.
References
- Knuth, Donald. The Art of Computer Programming Volume 2. Addison-Wesley, 1997.
- Tanvir Mustafy, Md. Tauhid Ur Rahman. Statistics and Data Analysis for Engineers and Scientists, Springer, 2024.