Table of Contents
Fetching ...

RAID: Retrieval-Augmented Anomaly Detection

Mingxiu Cai, Zhe Zhang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

TL;DR

This work reframes unsupervised anomaly detection as a retrieval-augmented problem, introducing RAID to leverage hierarchical, class-semantic-instance templates and a guided MoE-based generator to suppress matching noise. By building a coarse-to-fine retrieval pipeline and a two-stage filtering mechanism, RAID achieves robust pixel-level anomaly localization with strong generalization across full-shot, few-shot, and multi-dataset scenarios on MVTec-AD, VisA, MPDD, and BTAD. The approach demonstrates state-of-the-art performance, efficiency advantages from hierarchical retrieval, and broad applicability, including integration with reconstruction-based methods. Overall, RAID advances UAD by combining principled retrieval with noise-aware generation, enabling scalable, data-efficient industrial anomaly detection with precise localization.

Abstract

Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce \textbf{RAID}, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. \href{https://github.com/Mingxiu-Cai/RAID}{https://github.com/Mingxiu-Cai/RAID}.

RAID: Retrieval-Augmented Anomaly Detection

TL;DR

This work reframes unsupervised anomaly detection as a retrieval-augmented problem, introducing RAID to leverage hierarchical, class-semantic-instance templates and a guided MoE-based generator to suppress matching noise. By building a coarse-to-fine retrieval pipeline and a two-stage filtering mechanism, RAID achieves robust pixel-level anomaly localization with strong generalization across full-shot, few-shot, and multi-dataset scenarios on MVTec-AD, VisA, MPDD, and BTAD. The approach demonstrates state-of-the-art performance, efficiency advantages from hierarchical retrieval, and broad applicability, including integration with reconstruction-based methods. Overall, RAID advances UAD by combining principled retrieval with noise-aware generation, enabling scalable, data-efficient industrial anomaly detection with precise localization.

Abstract

Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce \textbf{RAID}, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. \href{https://github.com/Mingxiu-Cai/RAID}{https://github.com/Mingxiu-Cai/RAID}.
Paper Structure (27 sections, 15 equations, 12 figures, 24 tables)

This paper contains 27 sections, 15 equations, 12 figures, 24 tables.

Figures (12)

  • Figure 1: We reformulate UAD within the Retrieval-Augmented Generation (RAG) paradigm, effectively reducing retrieval and matching noise and enabling stronger generalization across full-shot, few-shot, and multi-dataset settings.
  • Figure 2: Overview of our RAID, which reinterprets UAD within a RAG paradigm. (a) In the retrieval stage, a hierarchical vector database is constructed, indexing tokenized templates into three sequential entity levels: class prototype, semantic prototype, and instance token. This structure allows efficient retrieval flow queried by input tokens. (b) In the generation stage, an anomaly cost volume is built by matching each input token with its retrieved template tokens. A guided MoE filter then dynamically refines this cost volume under the dual guidance of the retrieved semantic prototypes and the input tokens.
  • Figure 3: Qualitative comparison of multi-class anomaly localization results on MVTec-AD and VisA datasets.
  • Figure 4: Detailed architecture of the guided MoE filter.
  • Figure 5: T-SNE visualizations of CLS tokens on MVTec, VisA, MPDD, BTAD, and multi-dataset. Distinct and compact clusters reveal that CLS tokens encode strongly discriminative class-level semantics.
  • ...and 7 more figures