Harnessing Vision-Language Models for Time Series Anomaly Detection
Zelin He, Sarah Alnegheimish, Matthew Reimherr
TL;DR
The paper tackles time-series anomaly detection by integrating vision-language models to capture visual-temporal context without domain-specific training. It proposes a two-stage framework—ViT4TS for high-resolution localization and VLM4TS for global-context verification—by transforming time series into 2-D line-plot images. The method achieves state-of-the-art average F1-max on 9 of 11 benchmarks and dramatically reduces token usage compared with prompting-based baselines, demonstrating strong generalization and practical efficiency. This approach enables scalable, zero-shot TSAD across diverse domains, with potential extensions to multivariate data and adaptive patch sizing.
Abstract
Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM's visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.
