Dual-Student Knowledge Distillation Networks for Unsupervised Anomaly Detection
Liyi Yao, Shaobing Gao
TL;DR
This work tackles unsupervised industrial anomaly detection under data imbalance and diverse defect types by proposing Dual-Student Knowledge Distillation (DSKD), where a fixed teacher guides two inverted students (Se and Sd) to strengthen normal-pattern consistency while boosting anomaly representation. The method leverages multi-scale distillation of intermediate feature maps and a deep feature embedding bottleneck to fuse semantic information and promote diverse anomaly cues, with anomaly inference driven by pixel-level discrepancies across scales. Experiments on MVTec AD, MVTec 3D-AD, and MT Defects demonstrate strong image- and pixel-level detection and localization performance with low computational complexity, backed by comprehensive ablations that validate the contributions of the dual-student architecture, DF embedding, and multi-scale fusion. Overall, DSKD advances unsupervised AD by balancing robust normal-data alignment with enhanced sensitivity to anomalous patterns, offering practical impact for efficient industrial inspection and potential extension to 3D data.
Abstract
Due to the data imbalance and the diversity of defects, student-teacher networks (S-T) are favored in unsupervised anomaly detection, which explores the discrepancy in feature representation derived from the knowledge distillation process to recognize anomalies. However, vanilla S-T network is not stable. Employing identical structures to construct the S-T network may weaken the representative discrepancy on anomalies. But using different structures can increase the likelihood of divergent performance on normal data. To address this problem, we propose a novel dual-student knowledge distillation (DSKD) architecture. Different from other S-T networks, we use two student networks a single pre-trained teacher network, where the students have the same scale but inverted structures. This framework can enhance the distillation effect to improve the consistency in recognition of normal data, and simultaneously introduce diversity for anomaly representation. To explore high-dimensional semantic information to capture anomaly clues, we employ two strategies. First, a pyramid matching mode is used to perform knowledge distillation on multi-scale feature maps in the intermediate layers of networks. Second, an interaction is facilitated between the two student networks through a deep feature embedding module, which is inspired by real-world group discussions. In terms of classification, we obtain pixel-wise anomaly segmentation maps by measuring the discrepancy between the output feature maps of the teacher and student networks, from which an anomaly score is computed for sample-wise determination. We evaluate DSKD on three benchmark datasets and probe the effects of internal modules through ablation experiments. The results demonstrate that DSKD can achieve exceptional performance on small models like ResNet18 and effectively improve vanilla S-T networks.
