Table of Contents
Fetching ...

Structural Teacher-Student Normality Learning for Multi-Class Anomaly Detection and Localization

Hanqiu Deng, Xingyu Li

TL;DR

This work identifies cross-class interference as a key failure mode of teacher-student distillation when applied to multi-class anomaly detection. It introduces Structural Teacher-Student Normality Learning (SNL), which combines structural distillation (spatial-channel alignment and affinity-based constraints) with a Central Residual Aggregation Module to learn compact normal representations. On the MVTecAD and VisA benchmarks, SNL substantially improves both anomaly detection and localization compared with baseline FD/RD methods and state-of-the-art unified models. The approach offers strong generalization to existing teacher-student networks and provides interpretable structural cues through affinity and residual-normality learning, enabling robust multi-class anomaly coverage in practical settings.

Abstract

Visual anomaly detection is a challenging open-set task aimed at identifying unknown anomalous patterns while modeling normal data. The knowledge distillation paradigm has shown remarkable performance in one-class anomaly detection by leveraging teacher-student network feature comparisons. However, extending this paradigm to multi-class anomaly detection introduces novel scalability challenges. In this study, we address the significant performance degradation observed in previous teacher-student models when applied to multi-class anomaly detection, which we identify as resulting from cross-class interference. To tackle this issue, we introduce a novel approach known as Structural Teacher-Student Normality Learning (SNL): (1) We propose spatial-channel distillation and intra-&inter-affinity distillation techniques to measure structural distance between the teacher and student networks. (2) We introduce a central residual aggregation module (CRAM) to encapsulate the normal representation space of the student network. We evaluate our proposed approach on two anomaly detection datasets, MVTecAD and VisA. Our method surpasses the state-of-the-art distillation-based algorithms by a significant margin of 3.9% and 1.5% on MVTecAD and 1.2% and 2.5% on VisA in the multi-class anomaly detection and localization tasks, respectively. Furthermore, our algorithm outperforms the current state-of-the-art unified models on both MVTecAD and VisA.

Structural Teacher-Student Normality Learning for Multi-Class Anomaly Detection and Localization

TL;DR

This work identifies cross-class interference as a key failure mode of teacher-student distillation when applied to multi-class anomaly detection. It introduces Structural Teacher-Student Normality Learning (SNL), which combines structural distillation (spatial-channel alignment and affinity-based constraints) with a Central Residual Aggregation Module to learn compact normal representations. On the MVTecAD and VisA benchmarks, SNL substantially improves both anomaly detection and localization compared with baseline FD/RD methods and state-of-the-art unified models. The approach offers strong generalization to existing teacher-student networks and provides interpretable structural cues through affinity and residual-normality learning, enabling robust multi-class anomaly coverage in practical settings.

Abstract

Visual anomaly detection is a challenging open-set task aimed at identifying unknown anomalous patterns while modeling normal data. The knowledge distillation paradigm has shown remarkable performance in one-class anomaly detection by leveraging teacher-student network feature comparisons. However, extending this paradigm to multi-class anomaly detection introduces novel scalability challenges. In this study, we address the significant performance degradation observed in previous teacher-student models when applied to multi-class anomaly detection, which we identify as resulting from cross-class interference. To tackle this issue, we introduce a novel approach known as Structural Teacher-Student Normality Learning (SNL): (1) We propose spatial-channel distillation and intra-&inter-affinity distillation techniques to measure structural distance between the teacher and student networks. (2) We introduce a central residual aggregation module (CRAM) to encapsulate the normal representation space of the student network. We evaluate our proposed approach on two anomaly detection datasets, MVTecAD and VisA. Our method surpasses the state-of-the-art distillation-based algorithms by a significant margin of 3.9% and 1.5% on MVTecAD and 1.2% and 2.5% on VisA in the multi-class anomaly detection and localization tasks, respectively. Furthermore, our algorithm outperforms the current state-of-the-art unified models on both MVTecAD and VisA.
Paper Structure (20 sections, 14 equations, 5 figures, 6 tables)

This paper contains 20 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We visualize the performance degradation of one-class teacher-student networks, RD rd (left) and FD fd (right), in the multi-class anomaly detection task on MVTecAD. Our structural normality learning (SNL) strategy on the teacher-student model shows significant improvement of multi-class anomaly detection and localization on both methods. Besides, SNL can also boost the performance on one-class cases.
  • Figure 2: (a) Demonstration of cross-class interference in multi-class anomaly detection. (b) Empirical analysis of cross-class interference. We generated mixtures as anomaly samples from the “hazelnut” and “screw” of MVTecAD via mixup. FD fd and RD rd show no discrepancy in the anomaly scores, whereas our models exhibit significant differences. (c) Qualitative analysis of cross-class interference. We crop a small region from an image in the “cable” category and paste it onto an image in the “wood” category as an anomaly sample on MVTecAD. Both FD and RD fail to identify synthetic anomalous regions, whereas our models can locate the anomalies precisely.
  • Figure 3: Overview of our structural teacher-student framework. Left: during training with normal samples, our structural distillation quantifies and minimizes the difference between channel-wise features, spatial-wise features, intra-affinity and inter-affinity metrics for the $k$th block of teacher-student network. Right: during testing, for query samples, we measure the local and structural differences respectively by the channel-wise feature distance and intra-affinity distances of the teacher-student network for anomaly detection.
  • Figure 4: The overview of proposed CRAM. We add CRAM after each CNN block of the baseline student networks fdrd.
  • Figure 5: Visualization of our approach and baseline RD rd in various anomaly scenarios of MVTecAD and VisA.