Table of Contents
Fetching ...

Unified Unsupervised Anomaly Detection via Matching Cost Filtering

Zhe Zhang, Mingxiu Cai, Gaochang Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu

TL;DR

Unified Cost Filtering (UCF) reframes unsupervised anomaly detection across unimodal RGB and multimodal RGB–3D and RGB–Text as a three-stage pipeline: feature extraction, anomaly cost-volume construction, and cost-volume filtering. By constructing a multi-template, cross-/intra-modal similarity volume and refining it with a dual-stream attention-guided 3D U‑Net (RCSA), UCF suppresses pervasive matching noise from reconstruction shortcuts and cross-modal misalignments. Integrated as a plug-in into 10 diverse baselines, UCF achieves state-of-the-art results across 22 benchmarks, including challenging RGB, RGB–3D, and RGB–Text UAD tasks, with modest memory and compute overhead. The approach demonstrates strong increases in both detection (image-level) and localization (pixel/region-level) performance, enhancing practical deployment in industrial and medical domains and enabling broader cross-modal knowledge transfer in anomaly detection.

Abstract

Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

Unified Unsupervised Anomaly Detection via Matching Cost Filtering

TL;DR

Unified Cost Filtering (UCF) reframes unsupervised anomaly detection across unimodal RGB and multimodal RGB–3D and RGB–Text as a three-stage pipeline: feature extraction, anomaly cost-volume construction, and cost-volume filtering. By constructing a multi-template, cross-/intra-modal similarity volume and refining it with a dual-stream attention-guided 3D U‑Net (RCSA), UCF suppresses pervasive matching noise from reconstruction shortcuts and cross-modal misalignments. Integrated as a plug-in into 10 diverse baselines, UCF achieves state-of-the-art results across 22 benchmarks, including challenging RGB, RGB–3D, and RGB–Text UAD tasks, with modest memory and compute overhead. The approach demonstrates strong increases in both detection (image-level) and localization (pixel/region-level) performance, enhancing practical deployment in industrial and medical domains and enabling broader cross-modal knowledge transfer in anomaly detection.

Abstract

Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

Paper Structure

This paper contains 85 sections, 11 equations, 39 figures, 58 tables.

Figures (39)

  • Figure 1: We advocate a unified UAD perspective and introduce UCF, a generic matching-cost filtering method that plugs seamlessly into unimodal RGB glad, RGB–Text aprilgan, and RGB–3D m3dm scenarios. For each scenario, we present anomaly heatmaps and kernel density estimates (KDE) kde of detection logits. Baselines are shown in yellow and ours (+UCF) in green. UCF suppresses matching noise, reduces false positives and negatives, sharpens separability between anomalies and normals, and consistently improves performance.
  • Figure 2: Overview of our UCF, a generic plug-in for UAD. We reformulate UAD as a matching cost filtering process applicable to both unimodal (RGB) and multimodal (RGB–3D, RGB–Text) scenarios. (i) First, we employ baseline pre-trained encoders to extract features from the input and reference templates, which may be reconstructed normal samples, randomly sampled normal templates, or cross-modal counterparts. (ii) Second, we construct an anomaly cost volume based on global similarity matching across or within modalities. (iii) Lastly, we learn a matching cost filtering network, guided by attention queries derived from the input features and an initial anomaly map, to refine the volume and generate the final detection results. (iv) Further, we integrate a class-aware adaptor to tackle class imbalance and enhance the ability to deal with multiple anomaly classes simultaneously.
  • Figure 3: Qualitative results of unimodal RGB UAD. We present a comparison of multi-class anomaly localization between our method and GLAD (G) glad, HVQ-Trans (H) hvqtrans, and AnomalDF (A) anomalydino on MVTec-AD mvtec (top 3 rows) and VisA visa (bottom 3 rows). By integrating with existing works, our method mitigates matching noise (e.g., false negatives in PCB2, false positives in Pill, and blurred boundaries in Carpet), thus improving anomaly localization.
  • Figure 4: Qualitative results of multimodal RGB–3D UAD. We compare our method against M3DM m3dm and CFM cfm on Eyecandies eyecan (left column) and MVTec 3D-AD mvtec3d (right column) for unsupervised anomaly localization. Our approach improves multimodal anomaly detection, effectively reducing noise and enhancing the localization of anomalies across both datasets.
  • Figure 5: Qualitative results of multimodal RGB–Text UAD. We compare our method with AprilGAN aprilgan, AdaCLIP adaclip, and AnomalyCLIP anomalyclip on representative categories from medical datasets (left column) and industrial datasets (right column). By integrating our cross-modal matching cost filtering with existing RGB–Text baselines, our method yields more precise and robust anomaly localization.
  • ...and 34 more figures