De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

Yuzheng Wang; Dingkang Yang; Zhaoyu Chen; Yang Liu; Siao Liu; Wenqiang Zhang; Lihua Zhang; Lizhe Qi

De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

Yuzheng Wang, Dingkang Yang, Zhaoyu Chen, Yang Liu, Siao Liu, Wenqiang Zhang, Lihua Zhang, Lizhe Qi

TL;DR

This work addresses distribution shifts in Data-Free Knowledge Distillation (DFKD), treating them as harmful confounders that bias learning from substitution data. It introduces Knowledge Distillation Causal Intervention (KDCI), a backdoor-adjustment-based framework built on a customized causal graph that isolates the true teacher-student knowledge from shift-induced bias. KDCI operates in two stages: constructing a confounder dictionary from substitution data via prototype clustering and then compensating biased student predictions through a fusion with prior information F(z) learned by attention-based weighting. Empirically, KDCI provides consistent, significant gains when applied to six representative DFKD baselines across CIFAR-10/100, Tiny-ImageNet, and ImageNet, including improvements up to 15.54% on CIFAR-100, while maintaining modest computational overhead. This plug-and-play approach offers a causal, interpretable pathway to robust data-free distillation under distribution shifts, with practical implications for privacy-preserving model deployment.

Abstract

Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data. Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data. However, a long-overlooked issue is that the severe distribution shifts between their substitution and original data, which manifests as huge differences in the quality of images and class proportions. The harmful shifts are essentially the confounder that significantly causes performance bottlenecks. To tackle the issue, this paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts. By designing a customized causal graph, we first reveal the causalities among the variables in the DFKD task. Subsequently, we propose a Knowledge Distillation Causal Intervention (KDCI) framework based on the backdoor adjustment to de-confound the confounder. KDCI can be flexibly combined with most existing state-of-the-art baselines. Experiments in combination with six representative DFKD methods demonstrate the effectiveness of our KDCI, which can obviously help existing methods under almost all settings, \textit{e.g.}, improving the baseline by up to 15.54\% accuracy on the CIFAR-100 dataset.

De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 5 equations, 8 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Methodology
Causal Graph of DFKD Task
Causal Intervention via Backdoor Adjustment
De-confounded DFKD with KDCI
Experiments
Datasets and Models
Method Zoo
Confounder Setup
Performance Comparison
Analysis of Prior Information $F(\bm{z})$
Analysis of Confounder Dictionary $\bm{Z}$
Qualitative Results
Conclusion
...and 11 more sections

Figures (8)

Figure 1: Diagrams of the distribution shifts between the original and substitute data for existing DFKD methods on CIFAR-10. (a) represents the random visualization and FID score of the synthetic data by DAFL, DeepInv, and sampled by DFND. (b) indicates the proportion of sample numbers in various classes (%) of the original and substitute data.
Figure 2: The causal graph. (a) The existing methods ignore distribution shifts. (b) The shifts are alleviated by causal inference.
Figure 3: The overview of our KDCI. In stage (a), all substitution data is fed a pre-trained model to explore the prior knowledge and construct the confounder dictionary. In stage (b), the prototype integration is built by the confounder dictionary and is used to compensate for biased student predictions. The distillation loss is calculated between the teacher's prediction and the student's compensated prediction.
Figure 4: The test accuracy (%) on CIFAR-10 and CIFAR-100 datasets about different confounder dictionary size $N$. The teacher uses resnet-34, and the student uses resnet-18 as the backbones.
Figure 5: T-SNE results of vanilla and KDCI-based models performance on Tiny-ImageNet dataset. KDCI helps models obtain clearer clustering results, which show its strong positive impact.
...and 3 more figures

De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

TL;DR

Abstract

De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

Authors

TL;DR

Abstract

Table of Contents

Figures (8)