Table of Contents
Fetching ...

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, Jinqiao Wang

TL;DR

FiLo addresses zero-shot anomaly detection by replacing generic anomaly prompts with Fine-Grained Description (FG-Des) generated via LLMs and learnable text templates, and by enhancing localization through Grounding DINO-based preliminaries, position-aware prompts, and the MMCI module. The approach synergistically fuses adaptively described text with CLIP visual features to produce a final anomaly map $M$ and a global score $S_{global}$, while training with global cross-entropy and a local Focal-Dice loss. Across MVTec and VisA, FiLo achieves state-of-the-art zero-shot performance, exemplified by an image-level AUC of $83.9\%$ and a pixel-level AUC of $95.9\%$ on VisA, with consistent gains on MVTec. The work demonstrates strong practical impact for industrial quality control by improving detection accuracy and localization across diverse object categories without requiring normal/anomalous samples from the target domain.

Abstract

Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo.

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

TL;DR

FiLo addresses zero-shot anomaly detection by replacing generic anomaly prompts with Fine-Grained Description (FG-Des) generated via LLMs and learnable text templates, and by enhancing localization through Grounding DINO-based preliminaries, position-aware prompts, and the MMCI module. The approach synergistically fuses adaptively described text with CLIP visual features to produce a final anomaly map and a global score , while training with global cross-entropy and a local Focal-Dice loss. Across MVTec and VisA, FiLo achieves state-of-the-art zero-shot performance, exemplified by an image-level AUC of and a pixel-level AUC of on VisA, with consistent gains on MVTec. The work demonstrates strong practical impact for industrial quality control by improving detection accuracy and localization across diverse object categories without requiring normal/anomalous samples from the target domain.

Abstract

Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo.
Paper Structure (39 sections, 10 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 39 sections, 10 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of anomaly detection and localization between FiLo and previous ZSAD methods. Previous ZSAD methods utilize fixed templates and generic anomaly descriptions, potentially resulting in errors. Our FG-Des enhances detection accuracy with adaptively learned text templates and fine-grained anomaly descriptions. For localization, ZSAD methods often produce false positives in background areas by directly comparing image patches with text features. Our HQ-Loc approach, using Grounding DINO, location enhancement, and MMCI, effectively removes background regions and improves localization accuracy.
  • Figure 2: Overall architecture of FiLo. Given an input image, fine-grained anomaly types are generated by LLM. Then normal and detailed abnormal texts are input into Grounding DINO to obtain bounding boxes and are fed into CLIP Text Encoder to get $F_n$ and $F_a$. Intermediate patch features of input image are subjected to MMCI together with text features to compute anomaly map, and the global image features are compared with text features after adaptation to obtain global anomaly score.
  • Figure 3: Visualization result of FiLo on MVTec and VisA datasets. "CLIP output" refers to the localization results without HQ-Loc, while "Final mask" represents the final localization result.
  • Figure 4: Illustration of similarities between images and different fine-grained anomaly descriptions.
  • Figure 5: Comparison of FiLo on MVTec and VisA datasets with different numbers of learnable vectors.
  • ...and 6 more figures