What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic
Jeremy Kepner, Hayden Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill, Aydin Buluc, Chansup Byun, Timothy Davis, Vijay Gadepally, Daniel Grant, Michael Houle, Matthew Hubbell, Piotr Luszczek, Lauren Milechin, Chasen Milner, Guillermo Morales, Andrew Morris, Julie Mullen, Ritesh Patel, Alex Pentland, Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Peter Michaleas
TL;DR
The paper addresses how to define and predict what is normal in Internet traffic using privacy-preserving, anonymized observations at scale. It represents traffic as anonymized hypersparse traffic matrices ${\bf A}_t$ and analyzes fundamental quantities via GraphBLAS-enabled matrix operations, enabling trillions of events to be processed. It characterizes statistical properties across large datasets with heavy-tailed Zipf-Mandelbrot distributions and temporal patterns described by a modified Cauchy family for self- and cross-correlations, then unifies these insights into an empirical, low-parameter probability model for cross-observer visibility that depends on window size $N_V$, observed degree $d$, and time lag $t$, with site-specific but stable parameters ${\gamma}$, $\delta$, $\lambda$, $\alpha$, and $\beta$. The work demonstrates practical implications for sensor placement, anomaly detection, and reproducible observational science in cyberspace, and calls for expanded observatories and methodological growth to sustain privacy-aware, large-scale network science.
Abstract
Understanding what is normal is a key aspect of protecting a domain. Other domains invest heavily in observational science to develop models of normal behavior to better detect anomalies. Recent advances in high performance graph libraries, such as the GraphBLAS, coupled with supercomputers enables processing of the trillions of observations required. We leverage this approach to synthesize low-parameter observational models of anonymized Internet traffic with a high regard for privacy.
