Table of Contents
Fetching ...

Always be Pre-Training: Representation Learning for Network Intrusion Detection with GNNs

Zhengyao Gu, Diego Troy Lopez, Lilas Alrahis, Ozgur Sinanoglu

TL;DR

This work tackles label dependency and information loss in GNN-based NIDS by introducing dense vector representations for mixed numeric and categorical features coupled with in-context self-supervised pre-training. The two-stage approach pre-trains on unlabeled network traffic from the same network and then fine-tunes with limited labeled data, enabling strong performance with minimal annotation. Empirical results on ToN-IoT and NF-UQ-NIDS-V2 show that SSL-pretrained dense representations achieve near-supervised performance in few-shot settings and substantially improve data efficiency, outperforming target-encoded baselines. The findings suggest a practical path toward scalable, annotation-efficient NIDS deployment with adaptable multi-class capabilities.

Abstract

Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-context pre-training and the utilization of dense representations for categorical features to jointly overcome the label-dependency limitation. Our approach exhibits remarkable data efficiency, achieving over 98% of the performance of the supervised state-of-the-art with less than 4% labeled data on the NF-UQ-NIDS-V2 dataset.

Always be Pre-Training: Representation Learning for Network Intrusion Detection with GNNs

TL;DR

This work tackles label dependency and information loss in GNN-based NIDS by introducing dense vector representations for mixed numeric and categorical features coupled with in-context self-supervised pre-training. The two-stage approach pre-trains on unlabeled network traffic from the same network and then fine-tunes with limited labeled data, enabling strong performance with minimal annotation. Empirical results on ToN-IoT and NF-UQ-NIDS-V2 show that SSL-pretrained dense representations achieve near-supervised performance in few-shot settings and substantially improve data efficiency, outperforming target-encoded baselines. The findings suggest a practical path toward scalable, annotation-efficient NIDS deployment with adaptable multi-class capabilities.

Abstract

Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-context pre-training and the utilization of dense representations for categorical features to jointly overcome the label-dependency limitation. Our approach exhibits remarkable data efficiency, achieving over 98% of the performance of the supervised state-of-the-art with less than 4% labeled data on the NF-UQ-NIDS-V2 dataset.
Paper Structure (34 sections, 10 equations, 4 figures, 1 table)

This paper contains 34 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: NIDS deployment, centralizing intrusion detection by monitoring the network of activities idssurvey2013. Monitored devices can be general purpose computer hosts, IoT devices such as IP cameras, wireless devices, or other. Switches move traffic throughout a network and offer mirroring capabilities for NIDS data acquisition. NIDS monitors may be general purpose servers running applications which sniff the mirrored traffic directly on the wire.
  • Figure 2: The proposed training pipeline with in-context pre-training, demonstrated using Anomal-E anomale as the SSL technique. Here $A, B, C, D$ represent four arbitrary nodes (flow endpoints). The edges between them represent the flows. During the pre-training phase, we train an encoder $f_\theta$ and a decoder $q_\xi$. Subsequently, $f_\theta$ is connected to a classification head and trained using labeled data.
  • Figure 3: Full-data setting: F1-score of dense representation models (D) and target encoding models (T) trained on all labeled data. For each model type, we experiment with a SSL pre-trained variant (-SSL) and one without. Decision Tree (DT) is included here as a baseline.
  • Figure 4: Few-shot setting: Comparing pre-trained (target encoding T-SSL, dense representation D-SSL) and directly learned models (E-GraphSAGE, Decision Tree DT) on ToN-IoT (top) and NF-UQ-NIDS-V2 (bottom) with limited data. We present the best-performing E-GraphSAGE without pre-training in the plots (either target encoding T or dense representation D).