Table of Contents
Fetching ...

ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts

Zexin Wang, Changhua Pei, Yang Liu, Hengyue Jiang, Quan Zhou, Haotian Si, Hang Cui, Jianhui Li, Gaogang Xie, Jingjing Li, Dan Pei

TL;DR

This work tackles the challenge of train-once, infer across scenarios in time-series anomaly detection by introducing ViTs, a TS-VLM that converts time-series curves into images and uses proportional time-series image rescaling to handle arbitrarily long sequences. It combines an STL-based data generator with a three-stage Chain-of-TS fine-tuning pipeline (time-series knowledge injection, anomaly detection, and anomaly reasoning via RL) and a fixed-length training with adaptive-length inference strategy. Empirical results show ViTs significantly outperform state-of-the-art TSAD methods on synthetic data and achieve strong zero-shot performance on public datasets (average F1 around 0.8), validating robust generalization and practical benefits for KPI monitoring. The approach offers a scalable, interpretable, and data-efficient path for TSAD by aligning visual representations with human-centric anomaly reasoning and providing public-release code and data.

Abstract

Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of "train once, infer across scenarios" remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.

ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts

TL;DR

This work tackles the challenge of train-once, infer across scenarios in time-series anomaly detection by introducing ViTs, a TS-VLM that converts time-series curves into images and uses proportional time-series image rescaling to handle arbitrarily long sequences. It combines an STL-based data generator with a three-stage Chain-of-TS fine-tuning pipeline (time-series knowledge injection, anomaly detection, and anomaly reasoning via RL) and a fixed-length training with adaptive-length inference strategy. Empirical results show ViTs significantly outperform state-of-the-art TSAD methods on synthetic data and achieve strong zero-shot performance on public datasets (average F1 around 0.8), validating robust generalization and practical benefits for KPI monitoring. The approach offers a scalable, interpretable, and data-efficient path for TSAD by aligning visual representations with human-centric anomaly reasoning and providing public-release code and data.

Abstract

Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of "train once, infer across scenarios" remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.

Paper Structure

This paper contains 42 sections, 13 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of TS-LLM, PTSE-LLM, and TS-VLM .
  • Figure 2: Performance across different image types. LINE refers to a single line plot. LINE_STFT denotes a composite image containing two subplots—one showing the line plot and the other displaying the STFT. LINE_STFT_2 indicates two separate input images: one for the line plot and another for the STFT.
  • Figure 3: Overview of ViTs.
  • Figure 4: Illustration of frequency anomalies.
  • Figure 5: Reward during RL stage.
  • ...and 6 more figures