Table of Contents
Fetching ...

VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

TL;DR

VETime tackles the dual challenge of point and context anomalies in time-series data under zero-shot settings by unifying temporal and visual representations. It introduces a four-part pipeline comprising Reversible Image Conversion, Patch-Level Temporal Alignment, Anomaly Window Contrastive Learning, and Task-Adaptive Multi-Modal Fusion to enable fine-grained cross-modal interaction and dynamic fusion. Empirical results across 11 univariate datasets show VETime achieves state-of-the-art zero-shot performance with superior localization accuracy and substantially lower computational cost than vision-based approaches. The framework demonstrates strong generalization, supports multivariate extensions, and offers practical potential for robust, parameter-efficient TSAD in diverse domains.

Abstract

Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.

VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

TL;DR

VETime tackles the dual challenge of point and context anomalies in time-series data under zero-shot settings by unifying temporal and visual representations. It introduces a four-part pipeline comprising Reversible Image Conversion, Patch-Level Temporal Alignment, Anomaly Window Contrastive Learning, and Task-Adaptive Multi-Modal Fusion to enable fine-grained cross-modal interaction and dynamic fusion. Empirical results across 11 univariate datasets show VETime achieves state-of-the-art zero-shot performance with superior localization accuracy and substantially lower computational cost than vision-based approaches. The framework demonstrates strong generalization, supports multivariate extensions, and offers practical potential for robust, parameter-efficient TSAD in diverse domains.

Abstract

Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.
Paper Structure (55 sections, 13 equations, 10 figures, 10 tables)

This paper contains 55 sections, 13 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparison of the previous TSAD methods and the proposed VETime.
  • Figure 2: Overview of the proposed framework. The time series is first processed by a time-series encoder to extract temporal features $F_{TS}$ while simultaneously undergoing Reversible Image Conversion to generate visual input. Then visual features $F_{V_0}$ extracted from a frozen pre-trained image encoder are subsequently transformed into $F_V$ through Patch-Level Temporal Alignment to reinforce their temporal positional association. Then, $F_{TS}$ and $F_V$ are input into the Anomaly Window Contrastive Learning to derive an anomaly-enhanced representation $F_A$. Finally, a Task-Adaptive Multi-Modal Fusion module integrates all features ($F_A$, $F_{TS}$ , $F_V$). Final outputs $F_{AD}$ and $F_{Rec}$ are mapped to the original sequence length via token projection for the respective anomaly classification and reconstruction heads.
  • Figure 3: The framework of Reversible Image Conversion and Patch-level Temporal Alignment
  • Figure 4: An illustration of Anomaly Window Contrastive Learning
  • Figure 5: Hyperparameter analysis of $\lambda_{aw}, \lambda_{e}$, and $\tau$ in Terms of VUS-PR (%)
  • ...and 5 more figures