Table of Contents
Fetching ...

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Zhen Dai, Yueyi Luo

TL;DR

This work tackles zero-shot anomaly detection with pre-trained vision–language models by addressing two core limitations: rigid, static cross-modal fusion and the absence of local inductive biases necessary for dense prediction. The authors propose Architectural Co-Design (ACD-CLIP), combining a parameter-efficient Conv-LoRA adapter to inject local priors into vision encoders with a Dynamic Fusion Gateway that generates level-specific text descriptors for adaptive, context-aware fusion. Through hierarchical feature adaptation and dynamic text modeling, ACD-CLIP achieves state-of-the-art results on twelve industrial and medical benchmarks, demonstrating strong pixel-level localization and cross-domain generalization. The study provides a principled path for effectively adapting foundation models to dense perception tasks and suggests extensions to related dense-prediction challenges.

Abstract

Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

TL;DR

This work tackles zero-shot anomaly detection with pre-trained vision–language models by addressing two core limitations: rigid, static cross-modal fusion and the absence of local inductive biases necessary for dense prediction. The authors propose Architectural Co-Design (ACD-CLIP), combining a parameter-efficient Conv-LoRA adapter to inject local priors into vision encoders with a Dynamic Fusion Gateway that generates level-specific text descriptors for adaptive, context-aware fusion. Through hierarchical feature adaptation and dynamic text modeling, ACD-CLIP achieves state-of-the-art results on twelve industrial and medical benchmarks, demonstrating strong pixel-level localization and cross-domain generalization. The study provides a principled path for effectively adapting foundation models to dense perception tasks and suggests extensions to related dense-prediction challenges.

Abstract

Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.

Paper Structure

This paper contains 10 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of fusion paradigms. (a) Prior works rely on a rigid, static alignment between corresponding feature blocks. (b) Our Architectural Co-Design enables a flexible fusion policy by enriching visual features with local priors (Conv-LoRA) and then dynamically generating tailored text descriptors for each visual level (DFG).
  • Figure 2: Overview of the proposed ACD-CLIP architecture.(a) The Overall Framework: We structure CLIP's vision and text encoders into a hierarchy of $N$ sequential Groups (as illustrated, $N = 3$). Each vision group is enhanced by a trainable Conv-LoRA Adapter to instill local priors, while each corresponding text group incorporates a standard LoRA adapter. The Dynamic Fusion Gateway then uses each visual feature $V_i$ to generate a tailored text descriptor for producing a level-specific anomaly map. (b) The Conv-LoRA Adapter: Our parameter-efficient adapter features a multi-branch design with multi-scale convolutions inside a LoRA bottleneck.
  • Figure 3: Qualitative comparison on diverse industrial and medical datasets, showing our method's superior localization accuracy and noise suppression over state-of-the-art baselines.