Table of Contents
Fetching ...

Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction

Shiyan Hu, Jianxin Jin, Yang Shu, Peng Chen, Bin Yang, Chenjuan Guo

Abstract

Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.

Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction

Abstract

Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.
Paper Structure (33 sections, 1 theorem, 16 equations, 11 figures, 31 tables)

This paper contains 33 sections, 1 theorem, 16 equations, 11 figures, 31 tables.

Key Result

Lemma 1

For the mutual information $I(\mathbf{Z}_\mathrm{text}; \mathbf{Z_\mathrm{con}})$, there exists the following tight upper bound that can approximate its value: where $\mathrm{KL}(\cdot)$ denotes the Kullback–Leibler (KL) divergence, defined as $\mathrm{KL}(\mathbb{P}(x)||\mathbb{G}(\mathrm{x}))= \sum_{\mathrm{x}}\mathbb{P}(\mathrm{x})\log \frac{\mathbb{P}(\mathrm{x})}{\mathbb{G}(\mathrm{x})}$, $\

Figures (11)

  • Figure 1: (a) LLM-based methods generate endogenous text from time series without incorporating exogenous information. (b) Exogenous-based methods incorporate text information by retrieving background knowledge from the web. The absence of connecting lines indicates that the two modalities are not aligned. (c) MindTS employs cross-view fusion to ensure semantic consistency between the exogenous text and the time series, enabling more precise alignment across modalities.
  • Figure 2: MindTS overview. Given an input time series $\textbf{X}$, we first apply instance normalization and patching, then encode the patches using a time encoder. (a) Each patch generates its corresponding endogenous text $\textbf{O}$. Along with the input exogenous text $\textbf{C}$, both views are encoded and (b) fused via cross-view fusion to obtain fused text representations $\textbf{Z}_\text{text}$. Time and text representations are then semantically aligned via a multimodal alignment layer. (c) To mitigate textual redundancy, the aligned text is compressed using a content condenser. Finally, (d) the condensed text $\textbf{Z}_\text{con}$ is used to reconstruct the masked time series, enhancing cross-modal interaction.
  • Figure 3: Ablation studies for MindTS, with the highest metrics highlighted in dark-colored bars.
  • Figure 4: Results of the sensitivity analysis. The vertical coordinate shows the Aff-F score, with higher scores representing better performance. The dark line represents the mean of 5 experiments, and the light area represents the range.
  • Figure 5: Visualization comparisons of anomaly scores from MindTS for all datasets.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof