Table of Contents
Fetching ...

Attend, Distill, Detect: Attention-aware Entropy Distillation for Anomaly Detection

Sushovan Jena, Vishwas Saini, Ujjwal Shaw, Pavitra Jain, Abhay Singh Raihal, Anoushka Banerjee, Sharad Joshi, Ananth Ganesh, Arnav Bhavsar

TL;DR

The paper tackles scalable, unsupervised multi-class anomaly detection under strict latency constraints. It introduces DCAM, a training-time distributed convolutional attention module, integrated into a teacher–student knowledge-distillation framework with multi-scale feature matching using cosine distance ($CD$) and KL-divergence ($KLD$) losses to mitigate cross-class interference. Empirical results on the MVTec AD dataset show that the best configuration—Channel DCAM with Channel CD and Spatial KD—achieves AUROC of 95.20% and PRO of 89.81% while maintaining inference latency comparable to strong baselines, representing a 3.92% AUROC improvement over STFPM. The work demonstrates a scalable, real-time approach for industrial defect detection across diverse object classes by effectively distilling normal-class distributions into a compact student model.

Abstract

Unsupervised anomaly detection encompasses diverse applications in industrial settings where a high-throughput and precision is imperative. Early works were centered around one-class-one-model paradigm, which poses significant challenges in large-scale production environments. Knowledge-distillation based multi-class anomaly detection promises a low latency with a reasonably good performance but with a significant drop as compared to one-class version. We propose a DCAM (Distributed Convolutional Attention Module) which improves the distillation process between teacher and student networks when there is a high variance among multiple classes or objects. Integrated multi-scale feature matching strategy to utilise a mixture of multi-level knowledge from the feature pyramid of the two networks, intuitively helping in detecting anomalies of varying sizes which is also an inherent problem in the multi-class scenario. Briefly, our DCAM module consists of Convolutional Attention blocks distributed across the feature maps of the student network, which essentially learns to masks the irrelevant information during student learning alleviating the "cross-class interference" problem. This process is accompanied by minimizing the relative entropy using KL-Divergence in Spatial dimension and a Channel-wise Cosine Similarity between the same feature maps of teacher and student. The losses enables to achieve scale-invariance and capture non-linear relationships. We also highlight that the DCAM module would only be used during training and not during inference as we only need the learned feature maps and losses for anomaly scoring and hence, gaining a performance gain of 3.92% than the multi-class baseline with a preserved latency.

Attend, Distill, Detect: Attention-aware Entropy Distillation for Anomaly Detection

TL;DR

The paper tackles scalable, unsupervised multi-class anomaly detection under strict latency constraints. It introduces DCAM, a training-time distributed convolutional attention module, integrated into a teacher–student knowledge-distillation framework with multi-scale feature matching using cosine distance () and KL-divergence () losses to mitigate cross-class interference. Empirical results on the MVTec AD dataset show that the best configuration—Channel DCAM with Channel CD and Spatial KD—achieves AUROC of 95.20% and PRO of 89.81% while maintaining inference latency comparable to strong baselines, representing a 3.92% AUROC improvement over STFPM. The work demonstrates a scalable, real-time approach for industrial defect detection across diverse object classes by effectively distilling normal-class distributions into a compact student model.

Abstract

Unsupervised anomaly detection encompasses diverse applications in industrial settings where a high-throughput and precision is imperative. Early works were centered around one-class-one-model paradigm, which poses significant challenges in large-scale production environments. Knowledge-distillation based multi-class anomaly detection promises a low latency with a reasonably good performance but with a significant drop as compared to one-class version. We propose a DCAM (Distributed Convolutional Attention Module) which improves the distillation process between teacher and student networks when there is a high variance among multiple classes or objects. Integrated multi-scale feature matching strategy to utilise a mixture of multi-level knowledge from the feature pyramid of the two networks, intuitively helping in detecting anomalies of varying sizes which is also an inherent problem in the multi-class scenario. Briefly, our DCAM module consists of Convolutional Attention blocks distributed across the feature maps of the student network, which essentially learns to masks the irrelevant information during student learning alleviating the "cross-class interference" problem. This process is accompanied by minimizing the relative entropy using KL-Divergence in Spatial dimension and a Channel-wise Cosine Similarity between the same feature maps of teacher and student. The losses enables to achieve scale-invariance and capture non-linear relationships. We also highlight that the DCAM module would only be used during training and not during inference as we only need the learned feature maps and losses for anomaly scoring and hence, gaining a performance gain of 3.92% than the multi-class baseline with a preserved latency.
Paper Structure (19 sections, 9 equations, 3 figures, 5 tables)

This paper contains 19 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our teacher-student framework (Training phase). The orange and yellow blocks represent the $2^{nd}$, $3^{rd}$, and $4^{th}$ convolutional blocks of the teacher and student network respectively. During the training phase, the feature map of the student network passes through DCAM for feature refinement, followed by channel and spatial feature matching with the corresponding teacher feature maps (using cosine distance and KL divergence).
  • Figure 2: Overview of the Channel and Spatial attention module. F' is the refined feature map obtained after each attention block.
  • Figure 3: Overview of our teacher-student framework (Inference phase). The orange and yellow blocks represent the $2^{nd}$, $3^{rd}$, and $4^{th}$ convolutional blocks of the teacher and student network respectively. During the inference phase, the anomaly map is created by aggregating upsampled loss maps of each block calculated using cosine distance between the teacher and student feature maps. The progressive formation of the anomaly map for a sample test image (category: bottle) is shown alongside the ground truth.