Table of Contents
Fetching ...

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

TL;DR

FiLo++ tackles zero-shot and few-shot anomaly detection by using FusDes to generate fine-grained, task-specific anomaly descriptions via large language models and by combining fixed and learnable prompt templates with runtime filtering. DefLoc employs Grounding DINO for preliminary localization and a Multi-scale Deformable Cross-modal Interaction to accurately localize anomalies of diverse shapes and sizes, augmented by a position-enhanced patch matching branch for few-shot scenarios. The approach achieves state-of-the-art results on MVTec-AD and VisA in both zero-shot and few-shot settings, including a zero-shot VisA image AUC of 84.5% and pixel AUC of 96.2%. This work demonstrates that integrating language priors with vision-language localization enables rapid adaptation for anomaly detection without requiring extensive target-category data.

Abstract

Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

TL;DR

FiLo++ tackles zero-shot and few-shot anomaly detection by using FusDes to generate fine-grained, task-specific anomaly descriptions via large language models and by combining fixed and learnable prompt templates with runtime filtering. DefLoc employs Grounding DINO for preliminary localization and a Multi-scale Deformable Cross-modal Interaction to accurately localize anomalies of diverse shapes and sizes, augmented by a position-enhanced patch matching branch for few-shot scenarios. The approach achieves state-of-the-art results on MVTec-AD and VisA in both zero-shot and few-shot settings, including a zero-shot VisA image AUC of 84.5% and pixel AUC of 96.2%. This work demonstrates that integrating language priors with vision-language localization enables rapid adaptation for anomaly detection without requiring extensive target-category data.

Abstract

Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.
Paper Structure (26 sections, 11 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 11 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of anomaly detection and localization between FiLo++ and previous ZSAD methods. Previous ZSAD methods utilize generic anomaly descriptions, which may lead to errors. Our FusDes enhances detection accuracy by fine-grained anomaly descriptions, learnable templates, and runtime prompt filtering. For localization, existing ZSAD methods typically compare image patches directly with text features, resulting in false positives in background regions. Our DefLoc method effectively eliminates background areas and improves localization accuracy by employing Grounding DINO, position-enhanced text descriptions, and the MDCI module.
  • Figure 2: Overall architecture of FiLo++. Given an input image, an LLM generates fine-grained anomaly types. The normal and detailed anomaly texts are processed by Grounding DINO to obtain bounding boxes, then combined with fixed and learnable templates and encoded by the CLIP Text Encoder with runtime prompt filtering to produce $T_n$ and $T_a$. The image’s intermediate patch features interact with the text features through the MDCI module to create the vision-language anomaly map. A few-shot anomaly map is generated using the memory bank of few-shot normal samples. Finally, global image features are compared with the fused text features to obtain the global anomaly score.
  • Figure 3: Comparison of FiLo++ on MVTec and VisA datasets with different numbers of learnable vectors.
  • Figure 4: Comparison of FiLo++ on MVTec and VisA datasets with different convolution kernels.
  • Figure 5: 0-shot and 1-shot visualization results of FiLo++ on the MVTec-AD and VisA datasets. It can be observed that FiLo++ achieves anomaly segmentation results that closely approximate the ground truth even when only language or a very limited number of normal samples are provided as references.