What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

Jeremy Kepner; Hayden Jananthan; Michael Jones; William Arcand; David Bestor; William Bergeron; Daniel Burrill; Aydin Buluc; Chansup Byun; Timothy Davis; Vijay Gadepally; Daniel Grant; Michael Houle; Matthew Hubbell; Piotr Luszczek; Lauren Milechin; Chasen Milner; Guillermo Morales; Andrew Morris; Julie Mullen; Ritesh Patel; Alex Pentland; Sandeep Pisharody; Andrew Prout; Albert Reuther; Antonio Rosa; Gabriel Wachman; Charles Yee; Peter Michaleas

What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

Jeremy Kepner, Hayden Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill, Aydin Buluc, Chansup Byun, Timothy Davis, Vijay Gadepally, Daniel Grant, Michael Houle, Matthew Hubbell, Piotr Luszczek, Lauren Milechin, Chasen Milner, Guillermo Morales, Andrew Morris, Julie Mullen, Ritesh Patel, Alex Pentland, Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Peter Michaleas

TL;DR

The paper addresses how to define and predict what is normal in Internet traffic using privacy-preserving, anonymized observations at scale. It represents traffic as anonymized hypersparse traffic matrices ${\bf A}_t$ and analyzes fundamental quantities via GraphBLAS-enabled matrix operations, enabling trillions of events to be processed. It characterizes statistical properties across large datasets with heavy-tailed Zipf-Mandelbrot distributions and temporal patterns described by a modified Cauchy family for self- and cross-correlations, then unifies these insights into an empirical, low-parameter probability model for cross-observer visibility that depends on window size $N_V$, observed degree $d$, and time lag $t$, with site-specific but stable parameters ${\gamma}$, $\delta$, $\lambda$, $\alpha$, and $\beta$. The work demonstrates practical implications for sensor placement, anomaly detection, and reproducible observational science in cyberspace, and calls for expanded observatories and methodological growth to sustain privacy-aware, large-scale network science.

Abstract

Understanding what is normal is a key aspect of protecting a domain. Other domains invest heavily in observational science to develop models of normal behavior to better detect anomalies. Recent advances in high performance graph libraries, such as the GraphBLAS, coupled with supercomputers enables processing of the trillions of observations required. We leverage this approach to synthesize low-parameter observational models of anonymized Internet traffic with a high regard for privacy.

What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

TL;DR

and analyzes fundamental quantities via GraphBLAS-enabled matrix operations, enabling trillions of events to be processed. It characterizes statistical properties across large datasets with heavy-tailed Zipf-Mandelbrot distributions and temporal patterns described by a modified Cauchy family for self- and cross-correlations, then unifies these insights into an empirical, low-parameter probability model for cross-observer visibility that depends on window size

, observed degree

, and time lag

, with site-specific but stable parameters

, and

. The work demonstrates practical implications for sensor placement, anomaly detection, and reproducible observational science in cyberspace, and calls for expanded observatories and methodological growth to sustain privacy-aware, large-scale network science.

Abstract

Paper Structure (9 sections, 8 equations, 6 figures, 1 table)

This paper contains 9 sections, 8 equations, 6 figures, 1 table.

Introduction
Traffic Matrices and Network Quantities
Internet Statistical Properties
Sample Window Size
Probability Distributions
Temporal Self-Correlations
Temporal Cross-Correlations
Model Synthesis
Conclusions and Future Work

Figures (6)

Figure 1: Network Traffic Messages to Anonymized Traffic Matrix. Network traffic uses numbers to the denote source and destination addresses of messages. Network traffic messages can be aggregated and summarized into traffic matrices for analysis. These traffic matrices, when coupled with data sharing agreements, can be anonymized by relabeling source addresses (e.g., 4.4.4.4$\rightarrow$1.1.1.1) and destination addresses (e.g., 8.8.8.8$\rightarrow$2.2.2.2) using various anonymization schemes.
Figure 2: Streaming Network Traffic Quantities. Internet traffic streams of $N_V$ valid packets are divided into a variety of quantities for analysis: source packets, source fan-out, unique source-destination pair packets (or links), destination fan-in, and destination packets. Figure adapted from kepner19hypersparse.
Figure 3: Scaling with Packet Window Size. Network quantities vary with packet window size $N_V$. This example is derived from 100 billion packets collected at a large enterprise gateway. ( left) Unique external sources seen over time as a fraction of total packets for different window sizes illustrating the decreasing uniqueness as window size increases from $N_V = 2^{17}$ to $2^{27}$. ( middle) Data on the left divided by $N_V^{4/5}$ indicates that the number of unique sources is proportional to $N_V/ N_V^{4/5} = N_V^{1/5}$. ( right) Scaling of other network quantities from the same data set (see Table II in kepner2020multi): unique sources $\approx 5{\times}N_V^{1/5}$, unique destinations $\approx 2{\times}N_V^{1/2}$, and max link packets $\approx 0.03{\times}N_V^1$. [Note: while these scaling relationships are broadly observed the specific parameters are often site specific but stable over time kepner2020multikepner2021spatial.]
Figure 4: Power Law Distribution of Network Quantity Probabilities. ( top) Probability distributions of 5 representative measured network quantities (source packets, source fan-out, link packets, destination fan-in, and destination packets) spanning different locations, dates, and packet windows from the multi-billion packet MAWI data set. Blue circles are measured data with $\pm$1-$\sigma$ error bars. Black lines are the best-fit modified Zipf--Mandelbrot models with parameters $\delta$ and $\lambda$. ( bottom) Model fit parameters of the same 5 network quantities for 350 measured probability distributions for all locations, times, and sample windows sizes in the MAWI data sample; illustrating the relatively stability over time of model parameters at a given site. Figure adapted from kepner19hypersparsekepner2022new.
Figure 5: Internet Source Temporal Self-Correlations. ( left & middle) Self-correlations among different categories of sources (benign and malicious ) in the GreyNoise honeyfarm from 2021Q2 thru 2022Q1. ( left) Source self-correlations among sources observed by the CAIDA darknet telescope during 2022Q1 at noon (upper curve) and midnight (lower currve). Each point represents the sources drawn from a packet window with $N_V = 2^{30}$ valid packets. Solid lines denote measured data. Dashed lines correspond to the best fit modified Cauchy distribution. Corresponding modified Cauchy parameters and full-width-half-maximum time $t_{\rm half} = \beta^{1/\alpha}$ are shown illustrating the significant difference between benign and malicious traffic. Figure adapted from jananthan2023mapping.
...and 1 more figures

What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

TL;DR

Abstract

What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic

Authors

TL;DR

Abstract

Table of Contents

Figures (6)