Always be Pre-Training: Representation Learning for Network Intrusion Detection with GNNs
Zhengyao Gu, Diego Troy Lopez, Lilas Alrahis, Ozgur Sinanoglu
TL;DR
This work tackles label dependency and information loss in GNN-based NIDS by introducing dense vector representations for mixed numeric and categorical features coupled with in-context self-supervised pre-training. The two-stage approach pre-trains on unlabeled network traffic from the same network and then fine-tunes with limited labeled data, enabling strong performance with minimal annotation. Empirical results on ToN-IoT and NF-UQ-NIDS-V2 show that SSL-pretrained dense representations achieve near-supervised performance in few-shot settings and substantially improve data efficiency, outperforming target-encoded baselines. The findings suggest a practical path toward scalable, annotation-efficient NIDS deployment with adaptable multi-class capabilities.
Abstract
Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-context pre-training and the utilization of dense representations for categorical features to jointly overcome the label-dependency limitation. Our approach exhibits remarkable data efficiency, achieving over 98% of the performance of the supervised state-of-the-art with less than 4% labeled data on the NF-UQ-NIDS-V2 dataset.
