Table of Contents
Fetching ...

An Investigation into the Performance of Non-Contrastive Self-Supervised Learning Methods for Network Intrusion Detection

Hamed Fard, Tobias Schalau, Gerhard Wunder

TL;DR

This work tackles the challenge of limited labeled data in network intrusion detection by evaluating non-contrastive self-supervised learning (SSL) across multiple backbones and augmentation strategies. A two-stage, label-free pipeline learns normal-traffic representations with SSL (three encoders and six augmentations across five models) and detects anomalies via a K-means detector, evaluated on UNSW-NB15 and 5G-NIDD with 90 configurations. VICReg and Barlow Twins frequently yield top metrics, with Mixup (representation-space) and Gaussian Noise augmentation proving particularly effective on different datasets; however, autoencoder-based baselines can surpass non-contrastive SSL when properly tuned. The study highlights the critical roles of augmentation design and encoder choice in NIDS SSL, suggests that domain-specific augmentations and more advanced unsupervised detectors could further close the gap to reconstruction-based methods, and provides actionable insights for deploying label-efficient intrusion detection systems.

Abstract

Network intrusion detection, a well-explored cybersecurity field, has predominantly relied on supervised learning algorithms in the past two decades. However, their limitations in detecting only known anomalies prompt the exploration of alternative approaches. Motivated by the success of self-supervised learning in computer vision, there is a rising interest in adapting this paradigm for network intrusion detection. While prior research mainly delved into contrastive self-supervised methods, the efficacy of non-contrastive methods, in conjunction with encoder architectures serving as the representation learning backbone and augmentation strategies that determine what is learned, remains unclear for effective attack detection. This paper compares the performance of five non-contrastive self-supervised learning methods using three encoder architectures and six augmentation strategies. Ninety experiments are systematically conducted on two network intrusion detection datasets, UNSW-NB15 and 5G-NIDD. For each self-supervised model, the combination of encoder architecture and augmentation method yielding the highest average precision, recall, F1-score, and AUCROC is reported. Furthermore, by comparing the best-performing models to two unsupervised baselines, DeepSVDD, and an Autoencoder, we showcase the competitiveness of the non-contrastive methods for attack detection. Code at: https://github.com/renje4z335jh4/non_contrastive_SSL_NIDS

An Investigation into the Performance of Non-Contrastive Self-Supervised Learning Methods for Network Intrusion Detection

TL;DR

This work tackles the challenge of limited labeled data in network intrusion detection by evaluating non-contrastive self-supervised learning (SSL) across multiple backbones and augmentation strategies. A two-stage, label-free pipeline learns normal-traffic representations with SSL (three encoders and six augmentations across five models) and detects anomalies via a K-means detector, evaluated on UNSW-NB15 and 5G-NIDD with 90 configurations. VICReg and Barlow Twins frequently yield top metrics, with Mixup (representation-space) and Gaussian Noise augmentation proving particularly effective on different datasets; however, autoencoder-based baselines can surpass non-contrastive SSL when properly tuned. The study highlights the critical roles of augmentation design and encoder choice in NIDS SSL, suggests that domain-specific augmentations and more advanced unsupervised detectors could further close the gap to reconstruction-based methods, and provides actionable insights for deploying label-efficient intrusion detection systems.

Abstract

Network intrusion detection, a well-explored cybersecurity field, has predominantly relied on supervised learning algorithms in the past two decades. However, their limitations in detecting only known anomalies prompt the exploration of alternative approaches. Motivated by the success of self-supervised learning in computer vision, there is a rising interest in adapting this paradigm for network intrusion detection. While prior research mainly delved into contrastive self-supervised methods, the efficacy of non-contrastive methods, in conjunction with encoder architectures serving as the representation learning backbone and augmentation strategies that determine what is learned, remains unclear for effective attack detection. This paper compares the performance of five non-contrastive self-supervised learning methods using three encoder architectures and six augmentation strategies. Ninety experiments are systematically conducted on two network intrusion detection datasets, UNSW-NB15 and 5G-NIDD. For each self-supervised model, the combination of encoder architecture and augmentation method yielding the highest average precision, recall, F1-score, and AUCROC is reported. Furthermore, by comparing the best-performing models to two unsupervised baselines, DeepSVDD, and an Autoencoder, we showcase the competitiveness of the non-contrastive methods for attack detection. Code at: https://github.com/renje4z335jh4/non_contrastive_SSL_NIDS

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Visualisation of different augmentation strategies. Swap Noise: each feature of sample $i$ is randomly replaced with a feature from the same position in other samples, with probability $p$ from a Bernoulli distribution. $m$ is a binary mask vector with elements drawn from a Bernoulli distribution, and $\odot$ represents element-wise multiplication. Zero Out Noise: to generate the augmented view $i'$, features of sample $i$ are multiplied element-wise by 1 minus the binary mask vector. Gaussian Noise: $\overrightarrow{\epsilon}$ is a vector of values, with each value sampled from the normal distribution. This vector is element-wise multiplied with a binary mask vector and summed with the original sample vector $i$ to generate the augmented sample. Mixup: operates in the representation space where the encoder $f_\theta$ receives two copies of the representations $y = f_\theta(i)$ and $y' = f_{\theta}(i')$. Mixup creates a convex combination between $y$ and another randomly selected representation of the current batch $y_j$. Similarly, the second representation $y'$ is augmented with a different randomly selected representation of the batch. Subsets: dataset features are split into $k$ subsets before being fed into the encoder $f_\theta$. Each subset can overlap with a neighboring subset by a defined percentage of features. For $k > 2$, more than two views are obtained. Each subset is processed by the encoder $f_\theta$ to generate representations $y_1, y_2, \ldots, y_k$. These representations are aggregated using their element-wise mean, forming the final representation for the downstream task.
  • Figure 2: Comparison of the different non-contrastive ssl models. The two augmented views $x$, $x'$ are fed to an encoder $f$ (can be an MLP, CNN, or FT-T ) with weights $\theta$ which yields the representations $y = f_{\theta}(x)$, $y' = f_{\theta}(x')$. Then, $y$ and $y'$ are further processed by the network $h$ with weights $\phi$. $h$ is an mlp (two fully-connected layers with batch normalization and ReLU activation). After this step, different criteria are applied to the projector embeddings $z$ and $z'$. VICReg: regularizes the variance and covariance of each branch independently with $v$ and $cov$, respectively. The invariance term is determined as the mean-squared distance between each pair of vectors $z$ and $z'$. The final loss is the weighted sum of these three terms. BYOL: one branch incorporates an additional predictor, denoted as $g$ with weights $\psi$, to map the output of one network to the other, resulting in an asymmetric architecture. The output embeddings of the two branches are f-norm and the similarity loss is computed as the mse between them. Barlow Twins: its objective function assesses the cross-correlation matrix between the outputs of the two branches and has two terms: an invariance term ($inv$) that aims to set the diagonal elements of the cross-correlation matrix to 1 and a decorrelation term ($c$), which decorrelates pairs of different dimensions within the batch-wise normalized (B-Norm) embeddings. SimSiam: adds a predictor network in one branch and a stop-gradient operation in the other, omitting BYOL's moving average. W-MSE: applies batch slicing and a Cholesky decomposition-based whitening transformation to f-norm embeddings. The loss is the mse between whitened, normalized embeddings of the two branches.
  • Figure 3: K-means classifier generates an anomaly score ($as$) for a network traffic sample $i$.