ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Ke Ma; Jun Long; Hongxiao Fei; Liujie Hua; Zhen Dai; Yueyi Luo

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Zhen Dai, Yueyi Luo

TL;DR

This work tackles zero-shot anomaly detection with pre-trained vision–language models by addressing two core limitations: rigid, static cross-modal fusion and the absence of local inductive biases necessary for dense prediction. The authors propose Architectural Co-Design (ACD-CLIP), combining a parameter-efficient Conv-LoRA adapter to inject local priors into vision encoders with a Dynamic Fusion Gateway that generates level-specific text descriptors for adaptive, context-aware fusion. Through hierarchical feature adaptation and dynamic text modeling, ACD-CLIP achieves state-of-the-art results on twelve industrial and medical benchmarks, demonstrating strong pixel-level localization and cross-domain generalization. The study provides a principled path for effectively adapting foundation models to dense perception tasks and suggests extensions to related dense-prediction challenges.

Abstract

Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

TL;DR

Abstract

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)