netFound: Foundation Model for Network Security
Satyandra Guthula, Roman Beltiukov, Navya Battula, Wenbo Guo, Arpit Gupta, Inder Monga
TL;DR
NetFound addresses the generalization gap in ML for network security by pretraining a domain-specific transformer on unlabeled network telemetry and fine-tuning it for diverse tasks. It introduces four design pillars—protocol-aware tokenization, multi-modal embeddings, hierarchical transformers, and data-driven token composition—to capture hidden networking context across packets, bursts, and flows, handling long, heavy-tailed sequences up to $1296$ tokens per flow. Across five downstream tasks, netFound consistently outperforms four strong baselines and shows resilience to learning shortcuts and noisy labels, with ablations confirming the value of each design choice. The work demonstrates practical potential for robust, generalizable ML in production networks and provides open-source resources (code and pretrained models) to advance research and deployment.
Abstract
Developing generalizable ML-based solutions for disparate learning problems in network security is highly desired. However, despite a rich history of applying ML to network security, most existing solutions lack generalizability. This lack of progress can be attributed to an overreliance on supervised learning techniques and the associated challenges of curating well-specified labeled training data. This paper addresses a fundamental gap by introducing a novel transformer-based network foundation model, netFound. We employ self-supervised learning techniques on abundant, unlabeled network telemetry data for pre-training. This pretrained model can subsequently be fine-tuned to create generalizable learning artifacts for disparate learning tasks, even when using commonly available but challenging labeled datasets that are sparse, noisy, and skewed. To realize this goal, netFound leverages various domain-specific attributes and constraints unique to network data (packet traces) by developing multi-modal embeddings, protocol-aware tokenization, data-driven token composition, and hierarchical transformers. Our results demonstrate that netFound's domain-specific design choices ensure that it (1) effectively captures the hidden networking context in production settings, (2) outperforms four different SOTA methods on five different learning tasks, and (3) is robust to both noisy labels and learning shortcuts -- critical for developing generalizable ML models in practical settings.
