Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan; Jun Liu; Xiaochen Chen; Bin-Bin Gao; Jian Li; Yong Liu; Jinlong Peng; Chengjie Wang

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang

TL;DR

The paper tackles FSAD shortcomings caused by coarse image-level prompts by introducing MFSC, which yields multi-level, fine-grained textual descriptions of normal images. Built on MFSC, FineGrainedAD employs Multi-Level Learnable Prompts (MLLP) and Multi-Level Semantic Alignment (MLSA) to align prompts with visual regions, enabling token-wise and component-level anomaly localization. Through language-guided region aggregation, dynamic prompt-region matching, and multi-level loss functions, the method achieves state-of-the-art pixel-level AUROC in 1-, 2-, and 4-shot settings on MVTec-AD and VisA, with competitive inference efficiency and strong generalization (e.g., RealIAD). This approach advances FSAD toward practical, fine-grained localization in real-world industrial and medical scenarios.

Abstract

Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

TL;DR

Abstract

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)