0xHabib
HomePostsVisualizationsCheatsheetsNotesStudy DecksAbout

Built with Love. 0xHabib © 2025

Anonymous analytics are collected for performance monitoring and site improvement purposes.

Statistics and Probability for Engineering Benchmarks

The Statistics You Learned in School but Never Applied

Bridge the gap between academic statistics and real-world engineering.

July 14, 2025
17 min read
byMohamed Habib Jaouadi
#performance
#statistics
#benchmarking
#engineering
#systems

As engineers, we measure everything. The performance of database queries, the response time of APIs, network latency, memory usage, CPU utilization, etc...

We conduct these benchmarks, gather numbers, and subsequently make changes that affect millions of end-users and thousands of servers. But, the uncomfortable truth is that most of us are operating in a fog when it comes to understanding those numbers properly.

You've likely observed this scenario before: Two engineers benchmark the same system and get different results. One claims the new caching layer increased response time by 15%.The other disagrees and says it worsened performance. Both have numbers, both have confidence, and both could be wrong.

What is missing? A good understanding of probability and statistics as they relate to real-world engineering systems.

Why Probabilistic Thinking Matters in System Benchmarks

In the real world, systems are noisy and non-deterministic. Your web server has variability in the way it responds, at times taking a little over the 150ms you expect. Factors like CPU load, network traffic, and background processes can cause small performance variations. That’s just how real systems behave and there's always some level of variability and randomness involved.

Instead, these systems exhibit:

  • Natural variability: Background processes, garbage collection, thermal throttling
  • Environmental noise: Network congestion, disk I/O contention, CPU scheduling
  • Measurement uncertainty: Timer resolution, system call overhead, instrumentation impact

The point of the story is, without statistical tools to handle this variability, we will make bad engineering decisions based on poor or misinterpreted data.

Here's a real example: An engineer benchmarks two API endpoints and gets these results:

Endpoint A: 145ms, 152ms, 148ms, 151ms, 149ms
Endpoint B: 147ms, 153ms, 146ms, 150ms, 154ms

Quick glance: Endpoint A looks faster (average 149ms vs 150ms). But is this difference meaningful, or just noise? Without proper statistical analysis, you can't tell.

This is where probability and statistics become essential engineering tools, not academic luxuries.

Core Statistical Concepts Every Engineer Should Know

Descriptive Statistics: Your First Line of Defense

When you collect benchmark data, descriptive statistics help you understand what you're actually looking at.

Mean (Average): The sum divided by count. Useful but dangerous when used alone.

Mean calculation examplepython

Median: The middle value when sorted. More robust against outliers.

Median calculation examplepython

Standard Deviation: Measures how spread out your data is. Low std dev means consistent performance; high std dev indicates variability.

Standard deviation calculationpython

Percentiles (P50, P95, P99): The values below which a certain percentage of observations fall.

Percentile calculationspython

Why percentiles matter: In production systems, you care more about "What's the worst experience 5% of users will have?" (P95) than "What's the average experience?" Users don't experience averages.

Confidence Intervals: Expressing Uncertainty Like an Engineer

A confidence interval provides a range of plausible values for your measurement, considering uncertainty. Instead of saying "Response time is 150ms," you say "Response time is 150ms ± 15ms (95% confidence)."

An intuitive way to think about it is this: if you repeated your experiment many times, and each time calculated a 95% confidence interval, then about 95% of those intervals would contain the true (but unknown) mean response time.

Confidence Interval Calculationpython

Practical usage: When comparing two systems, overlapping confidence intervals suggest the difference might not be meaningful. Non-overlapping intervals indicate a likely real difference.

Statistical Laws That Save You From Bad Decisions

Law of Large Numbers: Why Sample Size Matters

The Law of Large Numbers states that as you collect more samples, your measured average gets closer to the true average of the system.

lim⁡n→∞1n∑i=1nXi=μ\lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n X_i = \mun→∞lim​n1​i=1∑n​Xi​=μ

Where XiX_iXi​ are independent, identically distributed random variables with expected value μ\muμ.

Engineering implication: Running your benchmark 3 times isn't enough. Neither is 10. You need enough samples for the noise to average out.

Law of Large Numbers Simulationpython

Rule of thumb: For stable systems, collecting 30 to 50 samples is usually enough to get a reliable average. If your system is noisy or you're trying to detect small performance differences, you may need hundreds of samples to get meaningful results.

Interactive Demonstration: Law of Large Numbers

See how sample size affects the reliability of your benchmark results:

Law of Large Numbers in System Benchmarking
See how increasing sample size makes your benchmark measurements more reliable and closer to the true system performance.
Normal Distribution - Well-behaved system with consistent performance

Central Limit Theorem: Why Averages Work

The Central Limit Theorem explains why averaging makes sense, even when your underlying data isn't normally distributed.

X‾n−μσ/n→dN(0,1)\frac{\overline{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)σ/n​Xn​−μ​d​N(0,1)

Where X‾n\overline{X}_nXn​ is the sample mean, μ\muμ is the population mean, σ\sigmaσ is the standard deviation, and N(0,1)N(0,1)N(0,1) is the standard normal distribution.

The theorem: When you take many samples and compute their average, those averages will be normally distributed around the true mean, regardless of the original distribution's shape.

Engineering implication: This justifies using confidence intervals and statistical tests based on normal distributions, even when individual response times follow other patterns.

Central Limit Theorem Demonstrationpython

Why you never connected this to engineering work: While most engineering programs teach statistics as the study of abstract mathematical ideas (such as hypothesis testing, normal distributed models, confidence intervals, etc...) , they rarely teach the application of statistical concepts to actual real-world systems. Maybe you learned t-tests in your probability class (but nobody ever told you how to apply them to query performance comparisons). You memorized the Central Limit Theorem for your exams, but no one explained how it validates your benchmarking approach.

Common Benchmarking Pitfalls and How to Avoid Them

Pitfall 1: The Single Run Trap

Wrong way - Single measurementbash

The problem: One measurement tells you almost nothing about system performance. You've captured a single point in a noisy, time-varying system.

The fix: Always run multiple iterations and report distributions.

Proper Endpoint Benchmarkingpython

Pitfall 2: The Mean-Only Mindset

The problem: Reporting only averages hides crucial information about system behavior.

Consider these two systems:

  • System A: Response times consistently 100ms ± 5ms
  • System B: Response times average 100ms, but range from 50ms to 500ms

Same average, completely different user experience.

The fix: Always report percentiles alongside means.

Statistics Functionspython

Pitfall 3: Flaky Test Syndrome

The problem: Your benchmark results vary wildly between runs, making comparisons impossible.

Common causes:

  • Insufficient warm-up period
  • Background processes interfering
  • Inconsistent load conditions
  • Measurement overhead

The fix: Control your environment and establish baseline stability.

Stable Benchmarking Functionspython

Practical Guidelines for Robust Benchmarks

1. Planning Your Benchmark

Before writing any code, ask yourself:

  • What exactly am I measuring? (Latency? Throughput? Resource usage?)
  • What factors might affect the results? (CPU load, memory pressure, network conditions)
  • How precise do I need to be? (Is a 5% difference meaningful for this system?)
  • What's my baseline? (Current system performance under identical conditions)

2. Environment Control

Environment Checking for Benchmarkspython

3. Comparing Two Systems Rigorously

When you need to determine if System B is actually better than System A:

Statistical System Comparisonpython

Real-World Examples

Example 1: Database Query Performance

You're optimizing a database query and want to measure the impact of adding an index.

Database Query Benchmarkingpython

Example 2: HTTP API Latency Analysis

You're comparing two API implementations to decide which to deploy.

API Benchmarking with Concurrencypython

Example 3: Load Testing and Capacity Planning

You need to determine how many concurrent users your system can handle.

Load Testing Frameworkpython

Interpreting Distributions Over Time

When monitoring production systems, you need to understand how performance metrics evolve. Here's how to track and interpret trends:

Time Series Performance Analysispython

Building Your Statistical Toolkit

As an engineer, you don't need to become a statistician, but having the right tools makes all the difference. Here's a practical toolkit:

Essential Python Libraries

Required Python Librariespython

Ready-to-Use Functions

Quick Benchmark Analysis Functionpython

Conclusion: Making Statistics Work for You

Statistics isn't merely a theoretical construct that resides in the ivory tower of academia. Real Statistics endures the test of evaluation and helps make more informed decisions based on the systems you build and operate.

Sample size is very important. The Law of Large Numbers is not an academic theory; it is the reason we need a large number of samples in order to trust our benchmarks. The larger the sample size, the more reliable the results are.

Just because you only look at averages doesn’t mean everything else isn’t available to see, understand, and use as well. I'm talking about the whole distribution, particularly slow experiences captured by high percentiles like P95 and P99 that matter most to users.

Lastly, maintain consistency across your testing environment. If you cannot replicate your benchmark environment, you won't learn anything.


Points to remember

  • Before you run any benchmark, ask yourself if you have enough samples, if you’re looking at the right metrics, if you can measure uncertainty, and whether the differences you see actually matter. These tools won’t speed up your systems by themselves, but they’ll help you know if your work is making a real difference and that kind of clarity is priceless.

  • Before you benchmark anything, consider if you have sufficient samples, if you are drawing the right contextual metrics, if you can quantify uncertainty, and if the differences you observe are consequential. These types of tools will not make your systems faster, but they can tell you if your action is positively impacting system performance, and that kind of knowledge is valuable!

Read Also

Technical visualization of Command and Control infrastructure
14 min read
December 14, 202514 min read

Command & Control in 2025: Architecture, Evasion & Operations

by Mohamed Habib Jaouadi

A technical deep dive into modern C2 architecture (Sliver, Havoc), evasion techniques (Shellter Elite, Stack Spoofing, AMSI Blinding), and alternative infrastructure (Discord C2, Cloud Redirectors).

#C2
#Malware Development
#Red Teaming
+3
Windows Protected Processes - Security Analysis and Inspection Tools
17 min read
November 22, 202517 min read

Windows Protected Processes Series: Part 1

by Mohamed Habib Jaouadi

Part 1 of the Windows Protected Processes series. Learn about protected processes, Process Explorer limitations, and why even administrators can't access critical system processes like CSRSS and LSASS.

#windows-protected-processes-series
#windows-internals
#process-inspection
+3
Windows Protected Processes Part 2 - Advanced Inspection and Security
33 min read
November 22, 202533 min read

Windows Protected Processes Series: Part 2

by Mohamed Habib Jaouadi

Advanced inspection techniques with Process Hacker, WinDbg kernel debugging, LSASS credential protection, BYOVD attacks, detection strategies, and system hardening for Windows protected processes.

#windows-protected-processes-series
#process-hacker
#windbg
+5
DNS Fundamentals and Security Analysis - DNS Security Series Part 1
20 min read
August 25, 202520 min read

DNS Security Analysis Series: Part 1 - DNS Fundamentals and Architecture

by Mohamed Habib Jaouadi

Deep dive into DNS architecture, record types, resolution process, and security analysis techniques for network defenders and DNS analysts.

#dns-security-series
#dns-analysis
#dns-forensics
+3
Network Architecture and Blue Team Defense Strategies
15 min read
August 7, 202515 min read

Enterprise Network Architecture for Blue Team Operations: Visibility, Segmentation, and Modern Defense Strategies

by Mohamed Habib Jaouadi

A guide to enterprise network architecture for blue team operations.

#blue-team
#network-architecture
#network-security
+5
Malware Development Series Part 3 - Detection and Windows Processes
16 min read
July 27, 202516 min read

Malware Development Series: Part 3

by Mohamed Habib Jaouadi

Detection mechanisms, Windows processes, threads, memory types, and the Process Environment Block (PEB) for security professionals.

#malware-development-series
#malware-detection
#windows-processes
+4
Malware Development Series Part 2 - Memory Management and PE Analysis
18 min read
July 13, 202518 min read

Malware Development Series: Part 2

by Mohamed Habib Jaouadi

Windows memory management, API fundamentals, PE file format, and DLL mechanics for security professionals.

#malware-development-series
#windows-memory
#pe-format
+3
Malware Development Series - Security Research and Analysis
10 min read
July 7, 202510 min read

Malware Development Series: Part 1

by Mohamed Habib Jaouadi

Part 1 of the malware development series. Learn the fundamentals of ethical malware development, Windows architecture, and essential tools for penetration testers and red teams.

#malware-development-series
#ethical-hacking
#red-team
+3
The Enigma Machine
14 min read
July 5, 202514 min read

The Hill Cipher: Linear Algebra Meets Cryptography

by Mohamed Habib Jaouadi

Exploring the Hill cipher, a polygraphic substitution cipher that uses linear algebra and matrix operations for encryption and decryption.

#cryptography
#classical-ciphers
#linear-algebra
+2