Interpret the Predictions of Deep Networks via Re-Label Distillation
Yingying Hua, Shiming Ge, Daichi Zhang
TL;DR
This work addresses local interpretability of deep network predictions by learning a simple linear model that maps images directly to predictions through self-supervised synthetic data generated by a VAE. By relabeling these synthetic samples with respect to whether the teacher's predictions shift, and distilling both soft and hard information from the teacher, the approach yields a transparent model that highlights salient features contributing to a given prediction. The method demonstrates stronger, more precise saliency explanations than several baselines on ImageNet with standard architectures, validated by both qualitative saliency maps and quantitative deletion/insertion metrics. Overall, re-label distillation provides a practical pathway to explainability with an interpretable, boundary-aware representation of the DNN's decision process.
Abstract
Interpreting the predictions of a black-box deep network can facilitate the reliability of its deployment. In this work, we propose a re-label distillation approach to learn a direct map from the input to the prediction in a self-supervision manner. The image is projected into a VAE subspace to generate some synthetic images by randomly perturbing its latent vector. Then, these synthetic images can be annotated into one of two classes by identifying whether their labels shift. After that, using the labels annotated by the deep network as teacher, a linear student model is trained to approximate the annotations by mapping these synthetic images to the classes. In this manner, these re-labeled synthetic images can well describe the local classification mechanism of the deep network, and the learned student can provide a more intuitive explanation towards the predictions. Extensive experiments verify the effectiveness of our approach qualitatively and quantitatively.
