Table of Contents
Fetching ...

Deep Learning for Network Anomaly Detection under Data Contamination: Evaluating Robustness and Mitigating Performance Degradation

D'Jeff K. Nkashama, Jordan Masakuna Félicien, Arian Soltani, Jean-Charles Verdier, Pierre-Martin Tardif, Marc Frappier, Froduald Kabanza

TL;DR

This study evaluates the robustness of six unsupervised DL algorithms against data contamination using a proposed enhanced auto-encoder with a constrained latent representation, showing improved resistance to data contamination compared to existing methods, offering a promising direction for more robust NAD systems.

Abstract

Deep learning (DL) has emerged as a crucial tool in network anomaly detection (NAD) for cybersecurity. While DL models for anomaly detection excel at extracting features and learning patterns from data, they are vulnerable to data contamination -- the inadvertent inclusion of attack-related data in training sets presumed benign. This study evaluates the robustness of six unsupervised DL algorithms against data contamination using our proposed evaluation protocol. Results demonstrate significant performance degradation in state-of-the-art anomaly detection algorithms when exposed to contaminated data, highlighting the critical need for self-protection mechanisms in DL-based NAD models. To mitigate this vulnerability, we propose an enhanced auto-encoder with a constrained latent representation, allowing normal data to cluster more densely around a learnable center in the latent space. Our evaluation reveals that this approach exhibits improved resistance to data contamination compared to existing methods, offering a promising direction for more robust NAD systems.

Deep Learning for Network Anomaly Detection under Data Contamination: Evaluating Robustness and Mitigating Performance Degradation

TL;DR

This study evaluates the robustness of six unsupervised DL algorithms against data contamination using a proposed enhanced auto-encoder with a constrained latent representation, showing improved resistance to data contamination compared to existing methods, offering a promising direction for more robust NAD systems.

Abstract

Deep learning (DL) has emerged as a crucial tool in network anomaly detection (NAD) for cybersecurity. While DL models for anomaly detection excel at extracting features and learning patterns from data, they are vulnerable to data contamination -- the inadvertent inclusion of attack-related data in training sets presumed benign. This study evaluates the robustness of six unsupervised DL algorithms against data contamination using our proposed evaluation protocol. Results demonstrate significant performance degradation in state-of-the-art anomaly detection algorithms when exposed to contaminated data, highlighting the critical need for self-protection mechanisms in DL-based NAD models. To mitigate this vulnerability, we propose an enhanced auto-encoder with a constrained latent representation, allowing normal data to cluster more densely around a learnable center in the latent space. Our evaluation reveals that this approach exhibits improved resistance to data contamination compared to existing methods, offering a promising direction for more robust NAD systems.
Paper Structure (12 sections, 3 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 3 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Deep Auto-Encoder architecture where, $x$, $z$, and $x'$ denote the input data, the latent representation, and the reconstructed input, respectively.
  • Figure 2: Contour plot showing the decision boundary of DAE-LR utilizing our proposed loss function from Equation (\ref{['eq:recerr_prime']}), and trained on 2D synthetic contaminated data with different types of center $c$. (a) Standard DAE ($\lambda=0$). (b) DAE-LR mean-centered ($c=\Bar{z}$). (c) DAE-LR with a learnable center.
  • Figure 3: ROC curves illustrating the performance of the latent-regulated DAE-LR (as described in Equation \ref{['eq:recerr_prime']}), which has been trained on 2D synthetic and contaminated data. It's important to note that when $\lambda=0$, it corresponds to the standard DAE version.
  • Figure 4: Proposed evaluation protocol workflow for a single run. The model is trained with a contaminated set characterized by contamination ratio $\alpha$. The proportion of benign traffic and attack data in the test set is maintained constant across runs, regardless of the level of training set contamination. Moreover, the model threshold for the anomaly scoring function is computed on a separate validation set, distinct from the final test set. The final test set is excluded from the training process to prevent data leakage.
  • Figure 5: Visualization of Normal Traffic and Attack Data in a Two-Dimensional Space Using t-SNE van2008visualizing. (a) and (b) Show normal and attack data from the KDDCUP and NSL-KDD datasets, respectively, with all attack types combined into a single class. (c) Displays normal data and specifically highlights the infiltration attack data from the CSE-CIC-IDS2018 dataset.
  • ...and 3 more figures