Table of Contents
Fetching ...

Anomaly Detection of Tabular Data Using LLMs

Aodong Li, Yunhan Zhao, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, Stephan Mandt

TL;DR

This paper shows that pre-trained LLMs are zero-shot batch-level anomaly detectors, without extra distribution-specific model fitting, that can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions.

Abstract

Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.

Anomaly Detection of Tabular Data Using LLMs

TL;DR

This paper shows that pre-trained LLMs are zero-shot batch-level anomaly detectors, without extra distribution-specific model fitting, that can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions.

Abstract

Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.

Paper Structure

This paper contains 29 sections, 5 figures, 1 table, 3 algorithms.

Figures (5)

  • Figure 1: The illustration of batch-level anomaly detection with LLMs. We serialize the data batch into text and apply our proposed prompts as the input to LLMs. LLMs then respond by answering the indices of abnormal data based on LLMs' knowledge. The system message "Only answer data indices" regularizes LLM responses and ensures responses are easy to parse.
  • Figure 2: Illustration of Llama2 for batch-level anomaly detection before and after our fine-tuning strategy. With the same input prompt, Llama2-70b (70-billion parameter version) makes factual mistakes--two false negatives (missing 5 and 10) and one false positive (incorrect 14). These results are obtained from https://www.llama2.ai. On the contrary, our fine-tuned 7-billion parameter (10x smaller than Llama2-70b) Llama2-AD succeeds in discovering all anomalies.
  • Figure 3: Graphical models of the synthetic data generating processes. (Left) We use a binary Gaussian mixture (i.e., $K=2$) to generate a batch of continuous data of size $N$. One Gaussian corresponds to normal data, and another corresponds to abnormal data. (Right) A multinomial mixture model ($K=2$) for discrete data where one multinomial is for normal and one for abnormal data. $\pi$ controls the anomaly ratio. Specifics of the random variables in the models are in \ref{['app:syndata-eg']}
  • Figure 4: LLMs can detect low-density regions in a contaminated data distribution. We use our Mistral-AD fine-tuned based on Mistral as the demonstrating LLM. Normal data distribution is represented by two Gaussian distributions located at -25 and 25 respectively. The contaminated data distribution is formed by combining the normal distributions and a wide uniform distribution spanned over interval $[-100, 100]$, where the contamination ratio is 0.1, resulting in $p(x)$ in blue. We sample 500 independent batches from $p(x)$ and ask the LLM to predict anomalies using our proposed method for each batch. We collect all the predicted anomalies and estimate the density by a kernel density estimator, shown by $\hat{p}_a(x)$ in orange. $\hat{p}_a(x)$ successfully captures three low-density regions of $p(x)$, demonstrating the LLM's ability to detect anomalies. More details are in \ref{['app:exp-detail']}.
  • Figure 5: Examples of the synthetic data for fine-tuning.