ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts
Zexin Wang, Changhua Pei, Yang Liu, Hengyue Jiang, Quan Zhou, Haotian Si, Hang Cui, Jianhui Li, Gaogang Xie, Jingjing Li, Dan Pei
TL;DR
This work tackles the challenge of train-once, infer across scenarios in time-series anomaly detection by introducing ViTs, a TS-VLM that converts time-series curves into images and uses proportional time-series image rescaling to handle arbitrarily long sequences. It combines an STL-based data generator with a three-stage Chain-of-TS fine-tuning pipeline (time-series knowledge injection, anomaly detection, and anomaly reasoning via RL) and a fixed-length training with adaptive-length inference strategy. Empirical results show ViTs significantly outperform state-of-the-art TSAD methods on synthetic data and achieve strong zero-shot performance on public datasets (average F1 around 0.8), validating robust generalization and practical benefits for KPI monitoring. The approach offers a scalable, interpretable, and data-efficient path for TSAD by aligning visual representations with human-centric anomaly reasoning and providing public-release code and data.
Abstract
Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of "train once, infer across scenarios" remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.
