Table of Contents
Fetching ...

Multi-Aspect Mining and Anomaly Detection for Heterogeneous Tensor Streams

Soshi Kakio, Yasuko Matsubara, Ren Fujiwara, Yasushi Sakurai

TL;DR

HeteroComp addresses the challenge of online analysis of heterogeneous tensor streams containing both categorical and continuous attributes by introducing a components-based model that jointly captures latent groups and their temporal dynamics using Gaussian process priors. It avoids discretizing continuous attributes or timestamps, instead modeling their distributions nonparametrically via a logistic Gaussian process and GP-driven dynamics, enabling continuous density estimation and interpretable summaries. The framework supports incremental updates through collapsed Gibbs sampling and efficient GP approximations, and detects group anomalies with a chi-squared goodness-of-fit score across components and attributes. Empirical results on real datasets show superior group-anomaly detection accuracy and linear-time scalability, illustrating the method’s practical usefulness for cybersecurity, ecommerce analytics, and other multi-aspect streaming domains.

Abstract

Analysis and anomaly detection in event tensor streams consisting of timestamps and multiple attributes - such as communication logs(time, IP address, packet length)- are essential tasks in data mining. While existing tensor decomposition and anomaly detection methods provide useful insights, they face the following two limitations. (i) They cannot handle heterogeneous tensor streams, which comprises both categorical attributes(e.g., IP address) and continuous attributes(e.g., packet length). They typically require either discretizing continuous attributes or treating categorical attributes as continuous, both of which distort the underlying statistical properties of the data.Furthermore, incorrect assumptions about the distribution family of continuous attributes often degrade the model's performance. (ii) They discretize timestamps, failing to track the temporal dynamics of streams(e.g., trends, abnormal events), which makes them ineffective for detecting anomalies at the group level, referred to as 'group anomalies' (e.g, DoS attacks). To address these challenges, we propose HeteroComp, a method for continuously summarizing heterogeneous tensor streams into 'components' representing latent groups in each attribute and their temporal dynamics, and detecting group anomalies. Our method employs Gaussian process priors to model unknown distributions of continuous attributes, and temporal dynamics, which directly estimate probability densities from data. Extracted components give concise but effective summarization, enabling accurate group anomaly detection. Extensive experiments on real datasets demonstrate that HeteroComp outperforms the state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on the data stream length.

Multi-Aspect Mining and Anomaly Detection for Heterogeneous Tensor Streams

TL;DR

HeteroComp addresses the challenge of online analysis of heterogeneous tensor streams containing both categorical and continuous attributes by introducing a components-based model that jointly captures latent groups and their temporal dynamics using Gaussian process priors. It avoids discretizing continuous attributes or timestamps, instead modeling their distributions nonparametrically via a logistic Gaussian process and GP-driven dynamics, enabling continuous density estimation and interpretable summaries. The framework supports incremental updates through collapsed Gibbs sampling and efficient GP approximations, and detects group anomalies with a chi-squared goodness-of-fit score across components and attributes. Empirical results on real datasets show superior group-anomaly detection accuracy and linear-time scalability, illustrating the method’s practical usefulness for cybersecurity, ecommerce analytics, and other multi-aspect streaming domains.

Abstract

Analysis and anomaly detection in event tensor streams consisting of timestamps and multiple attributes - such as communication logs(time, IP address, packet length)- are essential tasks in data mining. While existing tensor decomposition and anomaly detection methods provide useful insights, they face the following two limitations. (i) They cannot handle heterogeneous tensor streams, which comprises both categorical attributes(e.g., IP address) and continuous attributes(e.g., packet length). They typically require either discretizing continuous attributes or treating categorical attributes as continuous, both of which distort the underlying statistical properties of the data.Furthermore, incorrect assumptions about the distribution family of continuous attributes often degrade the model's performance. (ii) They discretize timestamps, failing to track the temporal dynamics of streams(e.g., trends, abnormal events), which makes them ineffective for detecting anomalies at the group level, referred to as 'group anomalies' (e.g, DoS attacks). To address these challenges, we propose HeteroComp, a method for continuously summarizing heterogeneous tensor streams into 'components' representing latent groups in each attribute and their temporal dynamics, and detecting group anomalies. Our method employs Gaussian process priors to model unknown distributions of continuous attributes, and temporal dynamics, which directly estimate probability densities from data. Extracted components give concise but effective summarization, enabling accurate group anomaly detection. Extensive experiments on real datasets demonstrate that HeteroComp outperforms the state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on the data stream length.
Paper Structure (22 sections, 2 theorems, 13 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 2 theorems, 13 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

lemma 1

$\text{score}(\mathcal{X}^{C})$ follows a chi-squared distribution with $K (\sum_{{m_1}}^{M_1}U_{m_1} + \sum_{{m_2}}^{M_2}G_{m_2}-{M_1}-{M_2}+1)-1$ degrees of freedom.

Figures (8)

  • Figure 1: Modeling power of HeteroComp over (#3) Edge-IIoT dataset. Our proposed method can find the hidden components which represents different characteristics in both (a) categorical attribute (source port) and (b) continuous attribute (TCP segment length), and (c) component weight exhibit significant changes when cyber-attacks occurs.
  • Figure 2: Illustration of HeteroComp: Given a current tensor $\mathcal{X}^{C}$ consisting one categorical attribute and one continuous attribute, (1) it assigns components to each record in $\mathcal{X}^{C}$ and update model parameter $\mathbf{A}^{(1)}, \mathbf{C}^{(1)}, \mathbf{B}$, (2) it quickly and accurately detects group anomalies based on components counts.
  • Figure 3: Graphical model of HeteroComp.
  • Figure 4: Market analysis of HeteroComp in the #6 Amazon Movie&TV dataset. (a) The characteristics of four components (Adventure, Kids, SF/Comedy, Western) in categorical attribute (Title) and continuous attribute (price in US dollars). (b) Component weight exhibit significant changes in relation to the film’s release.
  • Figure 5: Dynamics of $\mathbf{B}$ (above) and attacked time (below) in (#4) DDos2019 dataset.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1: Model parameter set: $\Theta$
  • Definition 2: Stream Statistics: $\mathcal{S}$
  • lemma 1: Proof in Appendix \ref{['sec:proof_anomaly']}
  • lemma 2: Proof in Appendix \ref{['sec:proof_complexity']}