Table of Contents
Fetching ...

Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection

Fenfang Tao, Guo-Sen Xie, Fang Zhao, Xiangbo Shu

TL;DR

The paper tackles few-shot anomaly detection by moving beyond handcrafted text prompts and simple image adapters to exploit cross-layer visual context. It introduces KAG-prompt, which combines a kernel-aware hierarchical graph (KAHG) built from multi-kernel per-layer features with a memory-bank and a multi-information fusion (MIF) module to produce robust pixel- and image-level anomaly scores. Pixel maps from text alignment ($M_p$) and memory-based maps ($M_v$) are fused into $M$, while image-level scores combine a CLS-alignment term $s_1$ and a top-$k$ fusion term $s_2$ to yield final $s$. The approach achieves state-of-the-art FSAD performance on MVTecAD and VisA, with extensive ablations and visualizations confirming the contribution of cross-layer graph reasoning and multi-signal fusion, and it demonstrates strong practical potential for automated, low-data anomaly detection in industrial settings.

Abstract

Few-shot anomaly detection (FSAD) aims to detect unseen anomaly regions with the guidance of very few normal support images from the same class. Existing FSAD methods usually find anomalies by directly designing complex text prompts to align them with visual features under the prevailing large vision-language model paradigm. However, these methods, almost always, neglect intrinsic contextual information in visual features, e.g., the interaction relationships between different vision layers, which is an important clue for detecting anomalies comprehensively. To this end, we propose a kernel-aware graph prompt learning framework, termed as KAG-prompt, by reasoning the cross-layer relations among visual features for FSAD. Specifically, a kernel-aware hierarchical graph is built by taking the different layer features focusing on anomalous regions of different sizes as nodes, meanwhile, the relationships between arbitrary pairs of nodes stand for the edges of the graph. By message passing over this graph, KAG-prompt can capture cross-layer contextual information, thus leading to more accurate anomaly prediction. Moreover, to integrate the information of multiple important anomaly signals in the prediction map, we propose a novel image-level scoring method based on multi-level information fusion. Extensive experiments on MVTecAD and VisA datasets show that KAG-prompt achieves state-of-the-art FSAD results for image-level/pixel-level anomaly detection. Code is available at https://github.com/CVL-hub/KAG-prompt.git.

Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection

TL;DR

The paper tackles few-shot anomaly detection by moving beyond handcrafted text prompts and simple image adapters to exploit cross-layer visual context. It introduces KAG-prompt, which combines a kernel-aware hierarchical graph (KAHG) built from multi-kernel per-layer features with a memory-bank and a multi-information fusion (MIF) module to produce robust pixel- and image-level anomaly scores. Pixel maps from text alignment () and memory-based maps () are fused into , while image-level scores combine a CLS-alignment term and a top- fusion term to yield final . The approach achieves state-of-the-art FSAD performance on MVTecAD and VisA, with extensive ablations and visualizations confirming the contribution of cross-layer graph reasoning and multi-signal fusion, and it demonstrates strong practical potential for automated, low-data anomaly detection in industrial settings.

Abstract

Few-shot anomaly detection (FSAD) aims to detect unseen anomaly regions with the guidance of very few normal support images from the same class. Existing FSAD methods usually find anomalies by directly designing complex text prompts to align them with visual features under the prevailing large vision-language model paradigm. However, these methods, almost always, neglect intrinsic contextual information in visual features, e.g., the interaction relationships between different vision layers, which is an important clue for detecting anomalies comprehensively. To this end, we propose a kernel-aware graph prompt learning framework, termed as KAG-prompt, by reasoning the cross-layer relations among visual features for FSAD. Specifically, a kernel-aware hierarchical graph is built by taking the different layer features focusing on anomalous regions of different sizes as nodes, meanwhile, the relationships between arbitrary pairs of nodes stand for the edges of the graph. By message passing over this graph, KAG-prompt can capture cross-layer contextual information, thus leading to more accurate anomaly prediction. Moreover, to integrate the information of multiple important anomaly signals in the prediction map, we propose a novel image-level scoring method based on multi-level information fusion. Extensive experiments on MVTecAD and VisA datasets show that KAG-prompt achieves state-of-the-art FSAD results for image-level/pixel-level anomaly detection. Code is available at https://github.com/CVL-hub/KAG-prompt.git.

Paper Structure

This paper contains 17 sections, 17 equations, 37 figures, 28 tables.

Figures (37)

  • Figure 1: Comparisons of KAG-prompt and existing FSAD models. (a) Existing FSAD methods usually design complex text prompts, i.e., $T_{i}, W_{i}$ are manually designed and/or learnable text prompts. For query image branches, they only learn simple adapters to extract visual features for downstream tasks. However, this paradigm segments normal backgrounds into anomaly ones. (b) Our KAG-prompt can well predict the anomalies in the query image by constructing a kernel-aware hierarchical graph to capture cross-layer multi-level relationships.
  • Figure 2: The architecture of KAG-prompt. KAG-prompt contains two modules, i.e., KAHG and MIF. The KAHG module takes visual features from different layers as input and these features undergo information interaction within the kernel-aware hierarchical graph $\mathcal{G}=(\mathcal{N, E})$ before aligning with texts to obtain an anomaly localization map $M_p$. Next, the distance between the query image and the most similar patch feature in the memory bank is calculated to get the localization map $M_v$. In the MIF module, for image-level score calculation, the cls token is first adapted and aligned with the texts to get $s_{1}$; then, $M_p$ and $M_v$ are fused to get $s_{2}$ by a top-k fusion mechanism; finally, $s_1$ and $s_2$ are fused to achieve the image-level score $s$.
  • Figure 3: Ablation on top-k strategy $k$ on the 1-shot setting of the VisA dataset.
  • Figure 4: Visualization of KAG-prompt on VisA under 1-shot setting. The first row shows the query image, the second row depicts the corresponding ground truth, and the third row displays the heatmap of abnormal localization by KAG-prompt.
  • Figure 4: Ablation on the learning rate at the 1-shot setting of the VisA dataset.
  • ...and 32 more figures