Table of Contents
Fetching ...

netFound: Foundation Model for Network Security

Satyandra Guthula, Roman Beltiukov, Navya Battula, Wenbo Guo, Arpit Gupta, Inder Monga

TL;DR

NetFound addresses the generalization gap in ML for network security by pretraining a domain-specific transformer on unlabeled network telemetry and fine-tuning it for diverse tasks. It introduces four design pillars—protocol-aware tokenization, multi-modal embeddings, hierarchical transformers, and data-driven token composition—to capture hidden networking context across packets, bursts, and flows, handling long, heavy-tailed sequences up to $1296$ tokens per flow. Across five downstream tasks, netFound consistently outperforms four strong baselines and shows resilience to learning shortcuts and noisy labels, with ablations confirming the value of each design choice. The work demonstrates practical potential for robust, generalizable ML in production networks and provides open-source resources (code and pretrained models) to advance research and deployment.

Abstract

Developing generalizable ML-based solutions for disparate learning problems in network security is highly desired. However, despite a rich history of applying ML to network security, most existing solutions lack generalizability. This lack of progress can be attributed to an overreliance on supervised learning techniques and the associated challenges of curating well-specified labeled training data. This paper addresses a fundamental gap by introducing a novel transformer-based network foundation model, netFound. We employ self-supervised learning techniques on abundant, unlabeled network telemetry data for pre-training. This pretrained model can subsequently be fine-tuned to create generalizable learning artifacts for disparate learning tasks, even when using commonly available but challenging labeled datasets that are sparse, noisy, and skewed. To realize this goal, netFound leverages various domain-specific attributes and constraints unique to network data (packet traces) by developing multi-modal embeddings, protocol-aware tokenization, data-driven token composition, and hierarchical transformers. Our results demonstrate that netFound's domain-specific design choices ensure that it (1) effectively captures the hidden networking context in production settings, (2) outperforms four different SOTA methods on five different learning tasks, and (3) is robust to both noisy labels and learning shortcuts -- critical for developing generalizable ML models in practical settings.

netFound: Foundation Model for Network Security

TL;DR

NetFound addresses the generalization gap in ML for network security by pretraining a domain-specific transformer on unlabeled network telemetry and fine-tuning it for diverse tasks. It introduces four design pillars—protocol-aware tokenization, multi-modal embeddings, hierarchical transformers, and data-driven token composition—to capture hidden networking context across packets, bursts, and flows, handling long, heavy-tailed sequences up to tokens per flow. Across five downstream tasks, netFound consistently outperforms four strong baselines and shows resilience to learning shortcuts and noisy labels, with ablations confirming the value of each design choice. The work demonstrates practical potential for robust, generalizable ML in production networks and provides open-source resources (code and pretrained models) to advance research and deployment.

Abstract

Developing generalizable ML-based solutions for disparate learning problems in network security is highly desired. However, despite a rich history of applying ML to network security, most existing solutions lack generalizability. This lack of progress can be attributed to an overreliance on supervised learning techniques and the associated challenges of curating well-specified labeled training data. This paper addresses a fundamental gap by introducing a novel transformer-based network foundation model, netFound. We employ self-supervised learning techniques on abundant, unlabeled network telemetry data for pre-training. This pretrained model can subsequently be fine-tuned to create generalizable learning artifacts for disparate learning tasks, even when using commonly available but challenging labeled datasets that are sparse, noisy, and skewed. To realize this goal, netFound leverages various domain-specific attributes and constraints unique to network data (packet traces) by developing multi-modal embeddings, protocol-aware tokenization, data-driven token composition, and hierarchical transformers. Our results demonstrate that netFound's domain-specific design choices ensure that it (1) effectively captures the hidden networking context in production settings, (2) outperforms four different SOTA methods on five different learning tasks, and (3) is robust to both noisy labels and learning shortcuts -- critical for developing generalizable ML models in practical settings.
Paper Structure (39 sections, 5 figures, 11 tables)

This paper contains 39 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparison between a naive hierarchical model (strawman 2) and the proposed hierarchical transformer.
  • Figure 2: Data extraction, Featurization & Protocol-aware Tokenization: Pipeline for converting the packet traces into tokens with metadata. After the flows are extracted from packet traces, we collect the relevant fields into features at different granularities, following which we convert them into tokens.
  • Figure 3: Pre-training---the hierarchical transformer uses a subset of tokens, selected using data-driven methods, for model training. These tokens are extracted from packet fields through protocol-aware tokenization and are augmented with multi-modal embeddings. The dash lines in the model represent the skip connections.
  • Figure 4: The token prediction performance between netFound and its different ablated variations using long sequences (L), protocol-aware tokenization (T), multi-modality (M), and hierarchy (H).
  • Figure 5: The testing performance of netFound and baselines trained on training sets with different noisy label rates ($P_n$).