SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

Zongxiang Hu; Zhaosheng Zhang

SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

Zongxiang Hu, Zhaosheng Zhang

TL;DR

This work introduces a novel window self-attention mechanism based on the CLIP model, augmented with learnable prompts to process multi-level features within a Soldier-Officer Window Self-Attention (SOWA) framework, setting a new standard against existing state-of-the-art techniques.

Abstract

Visual anomaly detection is essential in industrial manufacturing, yet traditional methods often rely heavily on extensive normal datasets and task-specific models, limiting their scalability. Recent advancements in large-scale vision-language models have significantly enhanced zero- and few-shot anomaly detection. However, these approaches may not fully leverage hierarchical features, potentially overlooking nuanced details crucial for accurate detection. To address this, we introduce a novel window self-attention mechanism based on the CLIP model, augmented with learnable prompts to process multi-level features within a Soldier-Officer Window Self-Attention (SOWA) framework. Our method has been rigorously evaluated on five benchmark datasets, achieving superior performance by leading in 18 out of 20 metrics, setting a new standard against existing state-of-the-art techniques.

SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 13 figures, 7 tables)

This paper contains 22 sections, 6 equations, 13 figures, 7 tables.

Introduction
Related Work
Problem Defination.
SOWA: Soldier-Officer Window Attention
Hierarchical Frozen Window Self-Attention
Dual Learnable Prompts
FWA Adapter
Visual-Language Alignment
Few-Shot Inference
Experiments
Experimental Setups
Main Results
Mechanism Analysis
Performance Analysis
Inference Speed
...and 7 more sections

Figures (13)

Figure 1: A diagram illustrating the conceptual handling of different abnormal image feature patterns (point, line, plane, motley) with the number of stars indicating the depth of the ViT. Point: Few local textures and anomalies, best detected in shallow layers, as deeper layers diffuse these local features, making fine-grained anomalies harder to detect. Line: Involves both local features and large-scale observation, best processed in middle layers. Plane: Handles large areas of features, processed in deep layers where global properties and shapes are captured. Motley: Rich in local textures and anomalies, requiring a comprehensive approach from shallow to deep layers.
Figure 2: (a) The architecture of Hierarchical Frozen Window Self-Attention (HFWA) model. Block H1 to H4 represent the four feature extraction stages of CLIP ViT; (b) Detailed structure of FWA (Frozen Window Attention) adapter; (c) Detailed structure of FWA.
Figure 3: The dimensional transformations of the input feature through FWA adapter.
Figure 4: Few-Shot inference framework.
Figure 5: Comparisons with zero-/few-shot anomaly detection methods on datasets of MVTec-AD, Visa, BTAD, DAGM and DTD Synthetic.
...and 8 more figures

SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

TL;DR

Abstract

SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (13)