Table of Contents
Fetching ...

Can LLMs Serve As Time Series Anomaly Detectors?

Manqing Dong, Hao Huang, Longbing Cao

TL;DR

This work investigates whether large language models can serve as explainable detectors for time series anomalies. It first shows that direct application of LLMs to anomaly detection is ineffective, but prompt engineering enables GPT-4 to perform competitively and provide explanations, with LLaMA3 lagging behind. To boost performance, the authors introduce TTGenerator to synthesize labeled time series with anomalies and textual explanations, enabling instruction fine-tuning of LLaMA3 via LoRA. Results demonstrate that instruction-tuned LLaMA3 gains in several anomaly categories, particularly seasonality, while GPT-4 remains strong for shorter sequences but exhibits hallucinations in longer contexts. Overall, the paper highlights the potential of LLMs as time series anomaly detectors when combined with targeted prompts and synthetic, explainable data, while also outlining limitations and avenues for future fine-tuning access and modality representations.

Abstract

An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities of LLMs, specifically GPT-4 and LLaMA3, in detecting and explaining anomalies in time series. Our studies reveal that: 1) LLMs cannot be directly used for time series anomaly detection. 2) By designing prompt strategies such as in-context learning and chain-of-thought prompting, GPT-4 can detect time series anomalies with results competitive to baseline methods. 3) We propose a synthesized dataset to automatically generate time series anomalies with corresponding explanations. By applying instruction fine-tuning on this dataset, LLaMA3 demonstrates improved performance in time series anomaly detection tasks. In summary, our exploration shows the promising potential of LLMs as time series anomaly detectors.

Can LLMs Serve As Time Series Anomaly Detectors?

TL;DR

This work investigates whether large language models can serve as explainable detectors for time series anomalies. It first shows that direct application of LLMs to anomaly detection is ineffective, but prompt engineering enables GPT-4 to perform competitively and provide explanations, with LLaMA3 lagging behind. To boost performance, the authors introduce TTGenerator to synthesize labeled time series with anomalies and textual explanations, enabling instruction fine-tuning of LLaMA3 via LoRA. Results demonstrate that instruction-tuned LLaMA3 gains in several anomaly categories, particularly seasonality, while GPT-4 remains strong for shorter sequences but exhibits hallucinations in longer contexts. Overall, the paper highlights the potential of LLMs as time series anomaly detectors when combined with targeted prompts and synthetic, explainable data, while also outlining limitations and avenues for future fine-tuning access and modality representations.

Abstract

An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities of LLMs, specifically GPT-4 and LLaMA3, in detecting and explaining anomalies in time series. Our studies reveal that: 1) LLMs cannot be directly used for time series anomaly detection. 2) By designing prompt strategies such as in-context learning and chain-of-thought prompting, GPT-4 can detect time series anomalies with results competitive to baseline methods. 3) We propose a synthesized dataset to automatically generate time series anomalies with corresponding explanations. By applying instruction fine-tuning on this dataset, LLaMA3 demonstrates improved performance in time series anomaly detection tasks. In summary, our exploration shows the promising potential of LLMs as time series anomaly detectors.
Paper Structure (29 sections, 3 equations, 22 figures, 14 tables)

This paper contains 29 sections, 3 equations, 22 figures, 14 tables.

Figures (22)

  • Figure 1: Example of responses from LLaMA-3 and GPT-4 to time series with shape anomalies using direct and multi-modal instructions. The bottom panel shows the overall performance across all trial examples for different anomaly types: global point anomalies, local point anomalies, seasonal anomalies, trend anomalies, and shape anomalies; and prompting strategies: directly use LLMs, multimodal instruction, in-context and chain-of-thought strategies anomaly detection. For each anomaly type and prompting strategy, we conduct five trials and evaluate the correctness of both identified indices and explanations. A correctness rate of 100% means the model provided correct results in all five trials.
  • Figure 2: Templates for different prompt strategies, where the 'requirements' include the tasks for the LLMs to do, e.g., providing the indices for the anomalies, and explaining the reason if anomalies are detected, with examples in Figure \ref{['fig:example_response']}. More details can be found in Appendix \ref{['app:prompt_settings']}
  • Figure 3: Examples for a) good, b) bad, and c) hallucinated explanation by GPT-4 on IOPS dataset.
  • Figure 4: Examples for a) good, b) bad, and c) hallucinated explanation by LLaMA3
  • Figure 5: Full Instruction Prompt for Each Strategy. For in-context learning and chain-of-thought learning, either the general instruction or multi-modal instruction is added to the beginning of the prompt to guide LLMs in performing the anomaly detection task.
  • ...and 17 more figures