Table of Contents
Fetching ...

Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning

Jiahui Chen, Joe Breen, Jeff M. Phillips, Jacobus Van der Merwe

TL;DR

This work tackles the practical need for accurate, configurable network traffic classification by leveraging statistics from subflows and probabilistic likelihoods. It introduces a subflow-based traffic representation with $N \in \{25, 100, 1000\}$ and an 8-feature ${8}$ vector per subflow, then builds flow-level likelihoods using a gradient-boosted decision tree subflow classifier and joint likelihoods to yield a final decision. Three classification modes are explored: Strict Certainty, Majority Likelihood, and Incremental Classification, each with distinct tradeoffs between certainty, speed, and coverage; the method achieves near-perfect accuracy on a Science DMZ dataset and strong performance on a broader General dataset, with rapid decisions often after observing only a small fraction of subflows. The approach is validated on real-world data with detailed comparisons across datasets, subflow lengths, and percentages, and is complemented by code and data availability, highlighting practical applicability for security and management tasks in diverse networks. Overall, the framework offers configurable certainty, fast incremental decisions, and robust performance for distinguishing approved traffic from unknown activity in real networks.

Abstract

Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on statistics of sequences of packets to distinguish known, or approved, traffic from unknown traffic. Our method is based on likelihood estimation, provides a measure of certainty for classification decisions, and can classify traffic at adjustable certainty levels. Our classification method can also be applied in different classification scenarios, each prioritizing a different classification goal. We demonstrate how our classification scheme and all its configurations perform well on real-world traffic from a high performance computing network environment.

Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning

TL;DR

This work tackles the practical need for accurate, configurable network traffic classification by leveraging statistics from subflows and probabilistic likelihoods. It introduces a subflow-based traffic representation with and an 8-feature vector per subflow, then builds flow-level likelihoods using a gradient-boosted decision tree subflow classifier and joint likelihoods to yield a final decision. Three classification modes are explored: Strict Certainty, Majority Likelihood, and Incremental Classification, each with distinct tradeoffs between certainty, speed, and coverage; the method achieves near-perfect accuracy on a Science DMZ dataset and strong performance on a broader General dataset, with rapid decisions often after observing only a small fraction of subflows. The approach is validated on real-world data with detailed comparisons across datasets, subflow lengths, and percentages, and is complemented by code and data availability, highlighting practical applicability for security and management tasks in diverse networks. Overall, the framework offers configurable certainty, fast incremental decisions, and robust performance for distinguishing approved traffic from unknown activity in real networks.

Abstract

Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on statistics of sequences of packets to distinguish known, or approved, traffic from unknown traffic. Our method is based on likelihood estimation, provides a measure of certainty for classification decisions, and can classify traffic at adjustable certainty levels. Our classification method can also be applied in different classification scenarios, each prioritizing a different classification goal. We demonstrate how our classification scheme and all its configurations perform well on real-world traffic from a high performance computing network environment.

Paper Structure

This paper contains 29 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Feature Value CDFs for 100-Packet Subflows
  • Figure 2: Machine Learning Approach and Applications (with corresponding paper sections)
  • Figure 3: Data Collection Point in the University of Utah Science DMZ Sub-network
  • Figure 4: Science DMZ Dataset: Incremental Classification
  • Figure 5: General Dataset: Incremental Classification