Table of Contents
Fetching ...

LLM-Powered Text-Attributed Graph Anomaly Detection via Retrieval-Augmented Reasoning

Haoyan Xu, Ruizhi Qian, Zhengtao Yao, Ziyi Liu, Li Li, Yuqi Li, Yanshu Li, Wenqing Zheng, Daniele Rosa, Daniel Barcklow, Senthil Kumar, Jieyu Zhao, Yue Zhao

TL;DR

TAG-AD presents the first comprehensive benchmark for anomaly detection on text-attributed graphs, leveraging LLMs to generate realistic contextual anomalies in raw text and incorporating additional contextual, textual perturbation, and structural anomalies. The framework adopts a retrieval-augmented generation (RAG) approach to build a global anomaly knowledge base and distill it into a reusable analysis framework for zero-shot GAD, reducing manual prompt engineering. Extensive experiments compare unsupervised GNN-based detectors and zero-shot LLMs across four TAG datasets, showing LLMs excel at contextual anomalies while GNNs excel at structural anomalies, with RAG prompting narrowing gaps. The work provides datasets, code, and pipelines to foster integration of graph learning and foundation models for robust, scalable anomaly detection.

Abstract

Anomaly detection on attributed graphs plays an essential role in applications such as fraud detection, intrusion monitoring, and misinformation analysis. However, text-attributed graphs (TAGs), in which node information is expressed in natural language, remain underexplored, largely due to the absence of standardized benchmark datasets. In this work, we introduce TAG-AD, a comprehensive benchmark for anomaly node detection on TAGs. TAG-AD leverages large language models (LLMs) to generate realistic anomalous node texts directly in the raw text space, producing anomalies that are semantically coherent yet contextually inconsistent and thus more reflective of real-world irregularities. In addition, TAG-AD incorporates multiple other anomaly types, enabling thorough and reproducible evaluation of graph anomaly detection (GAD) methods. With these datasets, we further benchmark existing unsupervised GNN-based GAD methods as well as zero-shot LLMs for GAD. As part of our zero-shot detection setup, we propose a retrieval-augmented generation (RAG)-assisted, LLM-based zero-shot anomaly detection framework. The framework mitigates reliance on brittle, hand-crafted prompts by constructing a global anomaly knowledge base and distilling it into reusable analysis frameworks. Our experimental results reveal a clear division of strengths: LLMs are particularly effective at detecting contextual anomalies, whereas GNN-based methods remain superior for structural anomaly detection. Moreover, RAG-assisted prompting achieves performance comparable to human-designed prompts while eliminating manual prompt engineering, underscoring the practical value of our RAG-assisted zero-shot LLM anomaly detection framework.

LLM-Powered Text-Attributed Graph Anomaly Detection via Retrieval-Augmented Reasoning

TL;DR

TAG-AD presents the first comprehensive benchmark for anomaly detection on text-attributed graphs, leveraging LLMs to generate realistic contextual anomalies in raw text and incorporating additional contextual, textual perturbation, and structural anomalies. The framework adopts a retrieval-augmented generation (RAG) approach to build a global anomaly knowledge base and distill it into a reusable analysis framework for zero-shot GAD, reducing manual prompt engineering. Extensive experiments compare unsupervised GNN-based detectors and zero-shot LLMs across four TAG datasets, showing LLMs excel at contextual anomalies while GNNs excel at structural anomalies, with RAG prompting narrowing gaps. The work provides datasets, code, and pipelines to foster integration of graph learning and foundation models for robust, scalable anomaly detection.

Abstract

Anomaly detection on attributed graphs plays an essential role in applications such as fraud detection, intrusion monitoring, and misinformation analysis. However, text-attributed graphs (TAGs), in which node information is expressed in natural language, remain underexplored, largely due to the absence of standardized benchmark datasets. In this work, we introduce TAG-AD, a comprehensive benchmark for anomaly node detection on TAGs. TAG-AD leverages large language models (LLMs) to generate realistic anomalous node texts directly in the raw text space, producing anomalies that are semantically coherent yet contextually inconsistent and thus more reflective of real-world irregularities. In addition, TAG-AD incorporates multiple other anomaly types, enabling thorough and reproducible evaluation of graph anomaly detection (GAD) methods. With these datasets, we further benchmark existing unsupervised GNN-based GAD methods as well as zero-shot LLMs for GAD. As part of our zero-shot detection setup, we propose a retrieval-augmented generation (RAG)-assisted, LLM-based zero-shot anomaly detection framework. The framework mitigates reliance on brittle, hand-crafted prompts by constructing a global anomaly knowledge base and distilling it into reusable analysis frameworks. Our experimental results reveal a clear division of strengths: LLMs are particularly effective at detecting contextual anomalies, whereas GNN-based methods remain superior for structural anomaly detection. Moreover, RAG-assisted prompting achieves performance comparable to human-designed prompts while eliminating manual prompt engineering, underscoring the practical value of our RAG-assisted zero-shot LLM anomaly detection framework.

Paper Structure

This paper contains 46 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Our method introduces a RAG framework for zero-shot GAD, in which a globally retrieved anomaly knowledge base is distilled by an LLM into unified detection guidelines. During inference, each prompt integrates this analysis framework, the task description, and node-specific graph context to enable consistent and interpretable anomaly reasoning.
  • Figure 2: Performance comparison of zero-shot LLMs on the Cora dataset using Plain Prompt, RAG Prompt, and Manual Prompt. Plain Prompt: constructed without an explicit analysis framework. RAG Prompt: incorporates a retrieval-based analysis framework. Manual Prompt: uses a human-designed analysis framework.