Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Jiali Cheng, Chirag Agarwal, Hadi Amiri
TL;DR
This work investigates how knowledge distillation (KD) affects the transfer of debiasing capabilities across language and vision tasks, using diverse backbones and debiasing methods. By formalizing distillability and evaluating Teacher–Student, KD, and Non-KD scenarios, the authors reveal that debiasing transfer is neither guaranteed nor uniform; larger teachers do not always yield more robust students, and KD can even amplify reliance on spurious correlations. They uncover internal mechanisms via activation-pattern analysis and circuit discovery, showing attention divergence and circuit shifts as key factors in post-KD behavior. To improve distillability, the paper proposes data augmentation, iterative knowledge distillation, and initializing the student with teacher weights, demonstrating substantial gains. Overall, the findings offer practical guidance for designing robust debiasing strategies in KD pipelines and highlight the nuanced interplay between model scale, data, and training objectives.
Abstract
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
