Table of Contents
Fetching ...

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Jiali Cheng, Chirag Agarwal, Hadi Amiri

TL;DR

This work investigates how knowledge distillation (KD) affects the transfer of debiasing capabilities across language and vision tasks, using diverse backbones and debiasing methods. By formalizing distillability and evaluating Teacher–Student, KD, and Non-KD scenarios, the authors reveal that debiasing transfer is neither guaranteed nor uniform; larger teachers do not always yield more robust students, and KD can even amplify reliance on spurious correlations. They uncover internal mechanisms via activation-pattern analysis and circuit discovery, showing attention divergence and circuit shifts as key factors in post-KD behavior. To improve distillability, the paper proposes data augmentation, iterative knowledge distillation, and initializing the student with teacher weights, demonstrating substantial gains. Overall, the findings offer practical guidance for designing robust debiasing strategies in KD pipelines and highlight the nuanced interplay between model scale, data, and training objectives.

Abstract

Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

TL;DR

This work investigates how knowledge distillation (KD) affects the transfer of debiasing capabilities across language and vision tasks, using diverse backbones and debiasing methods. By formalizing distillability and evaluating Teacher–Student, KD, and Non-KD scenarios, the authors reveal that debiasing transfer is neither guaranteed nor uniform; larger teachers do not always yield more robust students, and KD can even amplify reliance on spurious correlations. They uncover internal mechanisms via activation-pattern analysis and circuit discovery, showing attention divergence and circuit shifts as key factors in post-KD behavior. To improve distillability, the paper proposes data augmentation, iterative knowledge distillation, and initializing the student with teacher weights, demonstrating substantial gains. Overall, the findings offer practical guidance for designing robust debiasing strategies in KD pipelines and highlight the nuanced interplay between model scale, data, and training objectives.

Abstract

Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.

Paper Structure

This paper contains 45 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Framework for the analysis of distillability of debiasing methods. (a) training from scratch: we train a debiasing method $M_i$ from scratch without knowledge distillation on different scales (teacher $\mathcal{T}$ and student $\mathcal{S}$ such that $\mathcal{T}$$>$$\mathcal{S}$) to obtain models $f_{\mathcal{T}}$, $f_{\mathcal{S}}$ respectively. (b) Training with knowledge distillation: we apply knowledge distillation to transfer knowledge from teacher ($f_{\mathcal{T}}$) to student ($g_{{\mathcal{T}}->\mathcal{S}}$). (c) Assessment: C1 determines if knowledge distillation can transfer the debiasing capability from teacher ($f_{\mathcal{T}}$) to student ($g_{{\mathcal{T}}->\mathcal{S}}$), C2 determines the contribution of knowledge distillation in training a debiased model, and C3 compares different debiasing methods and backbones under knowledge distillation.
  • Figure 2: Average performance gaps on ID, OOD, and Spurious Gap between (a) Teacher and Student and (b) Non-KD and KD. X-axis and Y-axis show the scale of student ($\mathcal{S}$) and teacher ($\mathcal{T}$) respectively. Each cell shows the performance gap. See Appendix \ref{['sec:additional_results']} for detailed results.
  • Figure 3: Prediction agreement on text datasets. Agreement increases as the scale of teacher and student get closer. See Appendix \ref{['sec:additional_results']} for detailed results.
  • Figure 4: Density of predicted probability. On OOD, students has larger deviation in prediction confidence than teachers. See Appendix \ref{['sec:additional_results']} for detailed results.
  • Figure 5: C1: Teacher vs. Student (Left) and C2: Non-KD vs. KD (Right): correctly predicted examples on OOD on text datasets.
  • ...and 11 more figures