Table of Contents
Fetching ...

Harnessing Vision-Language Models for Time Series Anomaly Detection

Zelin He, Sarah Alnegheimish, Matthew Reimherr

TL;DR

The paper tackles time-series anomaly detection by integrating vision-language models to capture visual-temporal context without domain-specific training. It proposes a two-stage framework—ViT4TS for high-resolution localization and VLM4TS for global-context verification—by transforming time series into 2-D line-plot images. The method achieves state-of-the-art average F1-max on 9 of 11 benchmarks and dramatically reduces token usage compared with prompting-based baselines, demonstrating strong generalization and practical efficiency. This approach enables scalable, zero-shot TSAD across diverse domains, with potential extensions to multivariate data and adaptive patch sizing.

Abstract

Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM's visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.

Harnessing Vision-Language Models for Time Series Anomaly Detection

TL;DR

The paper tackles time-series anomaly detection by integrating vision-language models to capture visual-temporal context without domain-specific training. It proposes a two-stage framework—ViT4TS for high-resolution localization and VLM4TS for global-context verification—by transforming time series into 2-D line-plot images. The method achieves state-of-the-art average F1-max on 9 of 11 benchmarks and dramatically reduces token usage compared with prompting-based baselines, demonstrating strong generalization and practical efficiency. This approach enables scalable, zero-shot TSAD across diverse domains, with potential extensions to multivariate data and adaptive patch sizing.

Abstract

Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM's visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.

Paper Structure

This paper contains 28 sections, 8 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: (Left) max F1 score averaged over all benchmarks, comparing VLM4TS to the best time-series-pretrained and from-scratch baselines. Error bars indicate standard deviation across datasets. (Right) max F1 score and token usage of VLM4TS versus language-model-based baselines.
  • Figure 2: (a) Presenting time series as textual input for LLMs can obscure real anomalies and increase incorrect detection, whereas visualizing time series as line plots makes contextual anomalies, such as distortions, readily apparent. (b) The resolution–context dilemma in VLM-based TSAD: a global plot preserves long-range context but compresses details for detection (left), while small-window views maintain high resolution but provide limited context and incur high token cost (right).
  • Figure 3: Overview of ViT4TS/VLM4TS (upper/lower pane). In ViT4TS, the raw time series is sliced into windows, and each window is transformed into an image and then embedded into multi-scale feature vectors. By comparing each of these features to others, ViT4TS localizes potentially anomalous regions, and outputs a set of candidate anomaly intervals. Then in VLM4TS, a VLM is then prompted to integrate global temporal context to refine the detection.
  • Figure 4: F1-max score and elapsed (wall-clock) time for ViT4TS with different backbones. Error bars indicate standard deviation across 3 replications.
  • Figure 5: Qualitative results on MSL C-1 (top) and SMAP A-4 (bottom), illustrating how VLM4TS refines the initial local anomaly proposals from ViT4TS by selectively keeping, discarding, or adding intervals based on global temporal understanding. "VLM-Long” refers to VLM prompting on the full-series image (ablation without ViT4TS screening); “VLM-Short” (TAMA) refers to VLM prompting on the rolling-windows image. Only representative segments are shown for clarity.
  • ...and 3 more figures