Table of Contents
Fetching ...

Adaptive Group Robust Ensemble Knowledge Distillation

Patrik Kenfack, Ulrich Aïvodji, Samira Ebrahimi Kahou

TL;DR

This paper addresses the problem that ensemble knowledge distillation can worsen worst-group performance when distilling from biased teachers. It introduces Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), which uses a biased model to steer gradient-space weighting and selectively emphasize teachers whose gradient directions diverge from the biased model, thereby reducing reliance on spurious correlations in the student. Across synthetic and real-world benchmarks, AGRE-KD consistently improves worst-group accuracy, often outperforming standard deep ensembles and other KD baselines, and remains robust across varying numbers of debiased teachers and architectural heterogeneity. The work demonstrates a practical, gradient-based, unsupervised strategy to improve subgroup fairness in distillation, with potential impact on deploying compact yet robust models on edge devices.

Abstract

Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively ``simple'' student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting. Our source code is available at https://github.com/patrikken/AGRE-KD

Adaptive Group Robust Ensemble Knowledge Distillation

TL;DR

This paper addresses the problem that ensemble knowledge distillation can worsen worst-group performance when distilling from biased teachers. It introduces Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), which uses a biased model to steer gradient-space weighting and selectively emphasize teachers whose gradient directions diverge from the biased model, thereby reducing reliance on spurious correlations in the student. Across synthetic and real-world benchmarks, AGRE-KD consistently improves worst-group accuracy, often outperforming standard deep ensembles and other KD baselines, and remains robust across varying numbers of debiased teachers and architectural heterogeneity. The work demonstrates a practical, gradient-based, unsupervised strategy to improve subgroup fairness in distillation, with potential impact on deploying compact yet robust models on edge devices.

Abstract

Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively ``simple'' student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting. Our source code is available at https://github.com/patrikken/AGRE-KD

Paper Structure

This paper contains 31 sections, 7 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our adaptive weighting method based on gradient direction. The bolder lines indicate the teacher's higher weight in the aggregated output.
  • Figure 2: Overview of AGRE-KD.
  • Figure 3: Results on the proportion of debiased teachers in the ensemble. We trained the student model using an ensemble of 5 teachers with different ratios of debiased teachers within the ensemble ({$0.2$, $0.4$, $0.6$, $0.8$, $1$}). agrekd effectively upweights and favors the least biased teachers in the ensemble, while other ensemble methods rely more on biased teachers' output and decrease the wga. agrekd maintains significantly higher wga, despite having only a single debiased model in the ensemble.
  • Figure 4: Effect of the parameter $\alpha$ (Eq. \ref{['eq:ens_KD']}) on the worst-group accuracy of DFR teachers.
  • Figure 5: Effect of the temperature parameter on the worst-group accuracy.
  • ...and 2 more figures