Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

José Camacho; Katarzyna Wasielewska; Pablo Espinosa; Marta Fuentes-García

Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

José Camacho, Katarzyna Wasielewska, Pablo Espinosa, Marta Fuentes-García

TL;DR

Data quality and preprocessing decisions can dominate anomaly-detection performance on real network traffic, often more than the choice of ML technique. The authors study the UGR'16 benchmark by creating four dataset variants that alter training data periods, directionality, and anonymization, and evaluate two distinct detectors (MSNM and OCSVM) using ROC/AUC and the U-Squared statistic to interpret results. They find that dataset biases and labeling inaccuracies drive larger performance changes than the ML method, with training data contaminated by anomalies reducing detection and feature patterns varying with the reference dataset; combining variants improves robustness. The work highlights the need for automatic data quality assessment in autonomous networks and offers a practical interpretive framework to diagnose when data issues, rather than model capability, drive observed performance.

Abstract

Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL). However, ML can only be as good as the data it is fitted with, and data quality is an elusive concept difficult to assess. In this paper, we show that relatively minor modifications on a benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies. Our findings illustrate that the widely adopted approach of comparing a set of models in terms of performance results (e.g., in terms of accuracy or ROC curves) may lead to incorrect conclusions when done without a proper understanding of dataset biases and sensitivity. We contribute a methodology to interpret a model response that can be useful for this understanding.

Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

TL;DR

Abstract

Paper Structure (4 sections, 5 figures, 1 table)

This paper contains 4 sections, 5 figures, 1 table.

Introduction
Materials and Methods
Experiments and results
Conclusions

Figures (5)

Figure 1: ROC curve (a) and attack-type based AUC results (b) for the data parsed from original unidirectional flows in UGR'16v1 and UGR'16v2, and for a variant of the latter with no IRC features (UGR'16v2 NoIRC).
Figure 2: Comparison of U-Squared statistics for the NERISBOTNET attack using as a reference UGR'16v1 (a) and UGR'16v2 (b).
Figure 3: Boxplots of selected features in background traffic (Negative) versus NERISBOTNET traffic (Positive).
Figure 4: ROC curve for the data parsed from anonymized bidirectional (UGR'16v3) and unidirectional (UGR'16v4) flows, and a combination of both (UGR'16v3v4).
Figure 5: Boxplot in background traffic (Negative) versus DOS traffic (Positive) of dport_telnet in UGR'16v3 (a) and of sport_telnet in UGR'16v4 (b).

Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

TL;DR

Abstract

Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

Authors

TL;DR

Abstract

Table of Contents

Figures (5)