Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning
Jiahui Chen, Joe Breen, Jeff M. Phillips, Jacobus Van der Merwe
TL;DR
This work tackles the practical need for accurate, configurable network traffic classification by leveraging statistics from subflows and probabilistic likelihoods. It introduces a subflow-based traffic representation with $N \in \{25, 100, 1000\}$ and an 8-feature ${8}$ vector per subflow, then builds flow-level likelihoods using a gradient-boosted decision tree subflow classifier and joint likelihoods to yield a final decision. Three classification modes are explored: Strict Certainty, Majority Likelihood, and Incremental Classification, each with distinct tradeoffs between certainty, speed, and coverage; the method achieves near-perfect accuracy on a Science DMZ dataset and strong performance on a broader General dataset, with rapid decisions often after observing only a small fraction of subflows. The approach is validated on real-world data with detailed comparisons across datasets, subflow lengths, and percentages, and is complemented by code and data availability, highlighting practical applicability for security and management tasks in diverse networks. Overall, the framework offers configurable certainty, fast incremental decisions, and robust performance for distinguishing approved traffic from unknown activity in real networks.
Abstract
Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on statistics of sequences of packets to distinguish known, or approved, traffic from unknown traffic. Our method is based on likelihood estimation, provides a measure of certainty for classification decisions, and can classify traffic at adjustable certainty levels. Our classification method can also be applied in different classification scenarios, each prioritizing a different classification goal. We demonstrate how our classification scheme and all its configurations perform well on real-world traffic from a high performance computing network environment.
