Table of Contents
Fetching ...

Can LLMs Understand Time Series Anomalies?

Zihao Zhou, Rose Yu

TL;DR

The paper critically evaluates whether contemporary LLMs truly understand time series anomalies, testing zero-shot and few-shot capabilities across textual and visual representations with a principled, hypothesis-driven framework. It demonstrates a clear visual advantage for anomaly detection using M-LLMs, while chain-of-thought reasoning and arithmetic biases do not reliably improve performance. The study reveals substantial model heterogeneity and that LLMs detect only trivial anomalies well, with more subtle real-world irregularities remaining challenging. These findings call for rigorous, controlled evaluations and suggest multimodal preprocessing and careful model selection for practical time-series anomaly detection tasks.

Abstract

Large Language Models (LLMs) have gained popularity in time series forecasting, but their potential for anomaly detection remains largely unexplored. Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs' behavior from time series forecasting research, we formulate key hypotheses about LLMs' capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses. Our investigation reveals several surprising findings about LLMs for time series: (1) LLMs understand time series better as images rather than as text, (2) LLMs do not demonstrate enhanced performance when prompted to engage in explicit reasoning about time series analysis. (3) Contrary to common beliefs, LLMs' understanding of time series does not stem from their repetition biases or arithmetic abilities. (4) LLMs' behaviors and performance in time series analysis vary significantly across different models. This study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand trivial time series anomalies, we have no evidence that they can understand more subtle real-world anomalies. Many common conjectures based on their reasoning capabilities do not hold. All synthetic dataset generators, final prompts, and evaluation scripts have been made available in https://github.com/rose-stl-lab/anomllm.

Can LLMs Understand Time Series Anomalies?

TL;DR

The paper critically evaluates whether contemporary LLMs truly understand time series anomalies, testing zero-shot and few-shot capabilities across textual and visual representations with a principled, hypothesis-driven framework. It demonstrates a clear visual advantage for anomaly detection using M-LLMs, while chain-of-thought reasoning and arithmetic biases do not reliably improve performance. The study reveals substantial model heterogeneity and that LLMs detect only trivial anomalies well, with more subtle real-world irregularities remaining challenging. These findings call for rigorous, controlled evaluations and suggest multimodal preprocessing and careful model selection for practical time-series anomaly detection tasks.

Abstract

Large Language Models (LLMs) have gained popularity in time series forecasting, but their potential for anomaly detection remains largely unexplored. Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs' behavior from time series forecasting research, we formulate key hypotheses about LLMs' capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses. Our investigation reveals several surprising findings about LLMs for time series: (1) LLMs understand time series better as images rather than as text, (2) LLMs do not demonstrate enhanced performance when prompted to engage in explicit reasoning about time series analysis. (3) Contrary to common beliefs, LLMs' understanding of time series does not stem from their repetition biases or arithmetic abilities. (4) LLMs' behaviors and performance in time series analysis vary significantly across different models. This study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand trivial time series anomalies, we have no evidence that they can understand more subtle real-world anomalies. Many common conjectures based on their reasoning capabilities do not hold. All synthetic dataset generators, final prompts, and evaluation scripts have been made available in https://github.com/rose-stl-lab/anomllm.
Paper Structure (62 sections, 10 equations, 21 figures, 1 table, 1 algorithm)

This paper contains 62 sections, 10 equations, 21 figures, 1 table, 1 algorithm.

Figures (21)

  • Figure 1: Example time series with different anomaly types, with anomalous regions highlighted in red.
  • Figure 1: Variants and their corresponding namecodes, see Appendix \ref{['app:variants']} for details
  • Figure 2: Example anomaly detection results for out-of-range anomalies. Direct thresholding with expert knowledge yields the best result, but the LLMs can also detect the approximate ranges without priors. Isolation Forest raises lots of false positives but still has a higher F1 than LLMs, which motivates the use of affinity F1.
  • Figure 3: Reflexive (prompt that induces reasoning) / Reflective (prompt asks for direct answer), Top 3 Affi-F1 prompt variant per mode, See Table \ref{['tab:variants']} for variant name codes.
  • Figure 4: Clean (original time series) / Noisy (time series with minimal injected noise), Top 3 Affi-F1 variants per noise level
  • ...and 16 more figures