Table of Contents
Fetching ...

AMAD: AutoMasked Attention for Unsupervised Multivariate Time Series Anomaly Detection

Tiange Huang, Yongjun Li

TL;DR

AMAD tackles unsupervised multivariate time series anomaly detection by introducing AutoMask Attention, which models multi-scale correlations, and by fusing local and global representations through Attention Mixup. The framework is trained with a reconstruction objective plus a Max-Min strategy and a Local-Global contrastive loss to encourage specialized local and global features without labeled data. Empirical results on five public benchmarks show competitive to state-of-the-art performance, with strong recall on PSM and SMAP and robust behavior across diverse datasets. The work demonstrates a practical, label-free approach that generalizes to varied anomaly patterns and temporal scales, offering improved applicability for industrial and sensor networks.

Abstract

Unsupervised multivariate time series anomaly detection (UMTSAD) plays a critical role in various domains, including finance, networks, and sensor systems. In recent years, due to the outstanding performance of deep learning in general sequential tasks, many models have been specialized for deep UMTSAD tasks and have achieved impressive results, particularly those based on the Transformer and self-attention mechanisms. However, the sequence anomaly association assumptions underlying these models are often limited to specific predefined patterns and scenarios, such as concentrated or peak anomaly patterns. These limitations hinder their ability to generalize to diverse anomaly situations, especially where the lack of labels poses significant challenges. To address these issues, we propose AMAD, which integrates \textbf{A}uto\textbf{M}asked Attention for UMTS\textbf{AD} scenarios. AMAD introduces a novel structure based on the AutoMask mechanism and an attention mixup module, forming a simple yet generalized anomaly association representation framework. This framework is further enhanced by a Max-Min training strategy and a Local-Global contrastive learning approach. By combining multi-scale feature extraction with automatic relative association modeling, AMAD provides a robust and adaptable solution to UMTSAD challenges. Extensive experimental results demonstrate that the proposed model achieving competitive performance results compared to SOTA benchmarks across a variety of datasets.

AMAD: AutoMasked Attention for Unsupervised Multivariate Time Series Anomaly Detection

TL;DR

AMAD tackles unsupervised multivariate time series anomaly detection by introducing AutoMask Attention, which models multi-scale correlations, and by fusing local and global representations through Attention Mixup. The framework is trained with a reconstruction objective plus a Max-Min strategy and a Local-Global contrastive loss to encourage specialized local and global features without labeled data. Empirical results on five public benchmarks show competitive to state-of-the-art performance, with strong recall on PSM and SMAP and robust behavior across diverse datasets. The work demonstrates a practical, label-free approach that generalizes to varied anomaly patterns and temporal scales, offering improved applicability for industrial and sensor networks.

Abstract

Unsupervised multivariate time series anomaly detection (UMTSAD) plays a critical role in various domains, including finance, networks, and sensor systems. In recent years, due to the outstanding performance of deep learning in general sequential tasks, many models have been specialized for deep UMTSAD tasks and have achieved impressive results, particularly those based on the Transformer and self-attention mechanisms. However, the sequence anomaly association assumptions underlying these models are often limited to specific predefined patterns and scenarios, such as concentrated or peak anomaly patterns. These limitations hinder their ability to generalize to diverse anomaly situations, especially where the lack of labels poses significant challenges. To address these issues, we propose AMAD, which integrates \textbf{A}uto\textbf{M}asked Attention for UMTS\textbf{AD} scenarios. AMAD introduces a novel structure based on the AutoMask mechanism and an attention mixup module, forming a simple yet generalized anomaly association representation framework. This framework is further enhanced by a Max-Min training strategy and a Local-Global contrastive learning approach. By combining multi-scale feature extraction with automatic relative association modeling, AMAD provides a robust and adaptable solution to UMTSAD challenges. Extensive experimental results demonstrate that the proposed model achieving competitive performance results compared to SOTA benchmarks across a variety of datasets.

Paper Structure

This paper contains 20 sections, 21 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The experimental results of our method (orange bars) compared to five SOTA methods on three datasets. The results show that our method outperforms the others in most metrics.
  • Figure 2: Model Architecture (a) AutoMask Block. Our proposed AutoMask block integrates sinusoidal relative positional information while receiving $Q$ (queries) and $K$ (keys), forming a module similar to Rotary Position Embedding (RoPE) that encodes relative positional information across multiple learnable frequencies $\omega$, thereby achieving the effect of learning multiscale sequence information (MSI). It also retains classical Self-Attention as a global sequence information (GSI)learner. After fusing the MSI and GSI through the Mixup module, attention weighting is applied to $V$ (values). The block is trained using a sequence reconstruction task. To ensure both Attention mechanisms effectively learn different aspects of sequence data information, we adopt a Max-Min training strategy under the global task of sequence reconstruction. (b) The position of the AutoMask block within the entire model stack, with an overall architecture consistent with that of the Transformer at the top level.
  • Figure 3: The AutoMask attention Mechanism incorporates a learnable modulation mechanism, which we named AutoMask. This mechanism directly draws inspiration from the idea of Fourier Decomposition, where a sufficient number of orthogonal trigonometric functions can combine to form any curve. In contrast to Rotary Position Embedding (RoPE)su2024roformer, AutoMask introduces multiple learnable trigonometric modulation terms with dynamic weights denoted by $\omega$. We refer to this as Automatic Masking. On the basis of learnable automatic masking, we employ Max-Min and contrastive training strategies (explained in the context) to make AutoMask more inclined to fit local features of the sequence. Each pair of components in the $Q/K$ vectors corresponds to a RoPE rotation angle $\theta$, embedding absolute positional information. After $l$ embeddings, the resulting vectors undergo weighted linear combination through AutoMask, yielding the AutoMask-embedded vectors $\widetilde{Q}$ and $\widetilde{K}$. AutoMask Attention
  • Figure 4: The Max-Min strategy. Steering AutoMask Attention to primarily represent local features, while Self Attention focuses on the global characteristics of the sequence. This is achieved by constructing a prior mean distribution based on the Cross-Attention Divergence defined using JS divergence as an anchor point (the green circular area). Both Attention outputs are intermediate logits. The minimization step updates only the weights of the AutoMask Attention sub-module, and the maximization step updates only the weights of the Self Attention sub-module. By reducing the correlation between the AutoMask Attention logits and the intermediate distribution and increasing the correlation between the Self Attention logits and the intermediate distribution, we can generally conclude that AutoMask Attention will focus more on local features of the sequence (the purple part), while Self Attention, as expected, will focus more on the overall features of the sequence (the light yellow part). Max-Min Strategy
  • Figure 5: Key Code for Contrastive Alignment Strategy
  • ...and 3 more figures