Table of Contents
Fetching ...

Multiple-Input Variational Auto-Encoder for Anomaly Detection in Heterogeneous Data

Phai Vu Dinh, Diep N. Nguyen, Dinh Thai Hoang, Quang Uy Nguyen, Eryk Dutkiewicz

TL;DR

The paper addresses anomaly detection on non-IID, heterogeneous data by introducing MIAEAD to score anomalies at the feature-subset level with multiple sub-encoders, and MIVAE to model the normal-data distribution in a latent space across subsets. It proves that MIVAE yields greater separation between normal and anomalous samples than VAEAD and demonstrates superior AUC on eight real-world datasets, with up to 6% improvements over strong unsupervised baselines and generative models. The results show robustness to varying anomaly ratios and dataset heterogeneity (CV), and highlight parameter-efficiency for MIVAE/MIAEAD compared to their baselines. The work suggests practical applicability to multi-source data and outlines directions for extension to images, time series, and unknown sub-dataset partitioning.

Abstract

Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.

Multiple-Input Variational Auto-Encoder for Anomaly Detection in Heterogeneous Data

TL;DR

The paper addresses anomaly detection on non-IID, heterogeneous data by introducing MIAEAD to score anomalies at the feature-subset level with multiple sub-encoders, and MIVAE to model the normal-data distribution in a latent space across subsets. It proves that MIVAE yields greater separation between normal and anomalous samples than VAEAD and demonstrates superior AUC on eight real-world datasets, with up to 6% improvements over strong unsupervised baselines and generative models. The results show robustness to varying anomaly ratios and dataset heterogeneity (CV), and highlight parameter-efficiency for MIVAE/MIAEAD compared to their baselines. The work suggests practical applicability to multi-source data and outlines directions for extension to images, time series, and unknown sub-dataset partitioning.

Abstract

Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.
Paper Structure (31 sections, 4 theorems, 34 equations, 9 figures, 16 tables)

This paper contains 31 sections, 4 theorems, 34 equations, 9 figures, 16 tables.

Key Result

Theorem 1

Assuming that $M$ sub-datasets of $\mathbf{X}$ have $M^{'}$ sub-datasets existing anomalies $j=\{1, \ldots, M^{'} \}$, where $1 \leq M^{'} \leq M$. The average anomaly score for each feature of normal samples in the dataset $\mathbf{X}$ is $\delta$, whilst that of anomalies is $E$. We will show that

Figures (9)

  • Figure 1: An example of a non-IID dataset collected from three different sources. Abnormal features are in black while normal ones are white.
  • Figure 2: MIAEAD architecture.
  • Figure 3: The proposed MIVAE architecture. MIVAE consists of multiple sub-encoders that simultaneously process data, including different feature subsets.
  • Figure 4: $\emph{AUC}$ obtained by MIAEAD and MIVAE on numbers of sub-encoders used.
  • Figure 5: Results of AEAD, MIAEAD, and MIVAE based on the ratio of anomalies by the normal samples on the M5 dataset.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof
  • Lemma 3
  • proof
  • Remark 1
  • ...and 6 more