Table of Contents
Fetching ...

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang

TL;DR

The paper tackles FSAD shortcomings caused by coarse image-level prompts by introducing MFSC, which yields multi-level, fine-grained textual descriptions of normal images. Built on MFSC, FineGrainedAD employs Multi-Level Learnable Prompts (MLLP) and Multi-Level Semantic Alignment (MLSA) to align prompts with visual regions, enabling token-wise and component-level anomaly localization. Through language-guided region aggregation, dynamic prompt-region matching, and multi-level loss functions, the method achieves state-of-the-art pixel-level AUROC in 1-, 2-, and 4-shot settings on MVTec-AD and VisA, with competitive inference efficiency and strong generalization (e.g., RealIAD). This approach advances FSAD toward practical, fine-grained localization in real-world industrial and medical scenarios.

Abstract

Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

TL;DR

The paper tackles FSAD shortcomings caused by coarse image-level prompts by introducing MFSC, which yields multi-level, fine-grained textual descriptions of normal images. Built on MFSC, FineGrainedAD employs Multi-Level Learnable Prompts (MLLP) and Multi-Level Semantic Alignment (MLSA) to align prompts with visual regions, enabling token-wise and component-level anomaly localization. Through language-guided region aggregation, dynamic prompt-region matching, and multi-level loss functions, the method achieves state-of-the-art pixel-level AUROC in 1-, 2-, and 4-shot settings on MVTec-AD and VisA, with competitive inference efficiency and strong generalization (e.g., RealIAD). This approach advances FSAD toward practical, fine-grained localization in real-world industrial and medical scenarios.

Abstract

Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

Paper Structure

This paper contains 13 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The perceptional visualization of different methods. Previous coarse-grained handcrafted (e.g., AnoVL) and learnable (e.g., PromptAD) image-level prompts leads to higher activation values among normal regions, suggesting semantic misalignment between image-level prompts and patch-level visual features.
  • Figure 2: Overview of the FineGrainedAD, which includes two branches: Vision-guided Anomaly Detection (VAD) branch and Prompt-guided Anomaly Detection (PAD) branch. VAD branch extracts visual features of few-shot normal images using CLIP, then compares them with the query image features to obtain the anomalous regions. PAD branch first utilize language-guided region aggregation to obtain matching relationship between multi-level prompts and corresponding visual regions, then optimizes the feature space of multi-level learnable prompts through multi-level alignment training, further adopts dynamic token-wise inference mechanism to assign appropriate prompt to each visual patch during inference, achieving accurate perception within each visual component.
  • Figure 3: The visualization of Multi-Level Fine-Grained Semantic Caption (MFSC) and Multi-level Prompts Construction through replacement and concatenation mechanism.
  • Figure 4: The visual example of automatically constructed Multi-Level Fine-Grained Semantic Caption (MFSC) on MVTec-AD and VisA datasets.
  • Figure 5: The qualitative comparisons of 4-shot anomaly localization methods on MVTec-AD and VisA.
  • ...and 2 more figures