Table of Contents
Fetching ...

Interpretable Feature Learning in Multivariate Big Data Analysis for Network Monitoring

José Camacho, Katarzyna Wasielewska, Rasmus Bro, David Kotz

TL;DR

The paper addresses the challenge of interpretability in data-driven network monitoring under Big Data, proposing an automatic feature-learning extension for Multivariate Big Data Analysis (MBDA). It introduces a learning algorithm (fclearner.py) that derives interpretable features via prevalence-based selection and integrates them into the MBDA pipeline (upstream, analysis, downstream) to enable scalable, interactive anomaly detection. Through two real-world case studies—UGR'16 Netflow and Dartmouth Wi‑Fi SNMP traps—it demonstrates that learned features improve anomaly detection and maintain human-readable explanations, while enabling significant data compression and efficient parallel processing. The work provides a practical, open-source Python toolchain that enhances observability and root-cause analysis in large-scale networks and highlights directions for improving sensitivity and grouping features for even richer interpretations.

Abstract

There is an increasing interest in the development of new data-driven models useful to assess the performance of communication networks. For many applications, like network monitoring and troubleshooting, a data model is of little use if it cannot be interpreted by a human operator. In this paper, we present an extension of the Multivariate Big Data Analysis (MBDA) methodology, a recently proposed interpretable data analysis tool. In this extension, we propose a solution to the automatic derivation of features, a cornerstone step for the application of MBDA when the amount of data is massive. The resulting network monitoring approach allows us to detect and diagnose disparate network anomalies, with a data-analysis workflow that combines the advantages of interpretable and interactive models with the power of parallel processing. We apply the extended MBDA to two case studies: UGR'16, a benchmark flow-based real-traffic dataset for anomaly detection, and Dartmouth'18, the longest and largest Wi-Fi trace known to date.

Interpretable Feature Learning in Multivariate Big Data Analysis for Network Monitoring

TL;DR

The paper addresses the challenge of interpretability in data-driven network monitoring under Big Data, proposing an automatic feature-learning extension for Multivariate Big Data Analysis (MBDA). It introduces a learning algorithm (fclearner.py) that derives interpretable features via prevalence-based selection and integrates them into the MBDA pipeline (upstream, analysis, downstream) to enable scalable, interactive anomaly detection. Through two real-world case studies—UGR'16 Netflow and Dartmouth Wi‑Fi SNMP traps—it demonstrates that learned features improve anomaly detection and maintain human-readable explanations, while enabling significant data compression and efficient parallel processing. The work provides a practical, open-source Python toolchain that enhances observability and root-cause analysis in large-scale networks and highlights directions for improving sensitivity and grouping features for even richer interpretations.

Abstract

There is an increasing interest in the development of new data-driven models useful to assess the performance of communication networks. For many applications, like network monitoring and troubleshooting, a data model is of little use if it cannot be interpreted by a human operator. In this paper, we present an extension of the Multivariate Big Data Analysis (MBDA) methodology, a recently proposed interpretable data analysis tool. In this extension, we propose a solution to the automatic derivation of features, a cornerstone step for the application of MBDA when the amount of data is massive. The resulting network monitoring approach allows us to detect and diagnose disparate network anomalies, with a data-analysis workflow that combines the advantages of interpretable and interactive models with the power of parallel processing. We apply the extended MBDA to two case studies: UGR'16, a benchmark flow-based real-traffic dataset for anomaly detection, and Dartmouth'18, the longest and largest Wi-Fi trace known to date.

Paper Structure

This paper contains 28 sections, 3 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of a simple multivariate example.
  • Figure 2: Multivariate Big Data Analysis diagram: upstream phase, analysis phase and downstream phase. The first and last phases are performed in a cluster of computers or powerful server. The analysis can be performed on a regular computer. Comic image from www.slon.pics SLON in Freepik.
  • Figure 3: ROC curves (a) and attack-type based AUC results (b) for a set of different solutions based on MBDA.
  • Figure 4: Profile of detection of NERISBOTNET attacks with MBDA Opt (a) and MBDA FL$_{0.01}$ (b) using oMEDA.
  • Figure 5: Boxplots and ttests of selected features in background traffic (Negative) versus NERISBOTNET traffic (Positive).
  • ...and 6 more figures