Mitigating Bias in Dataset Distillation
Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh
TL;DR
This paper investigates how biases in the original dataset propagate through dataset distillation (DD) and how they can be mitigated. It shows that color and background biases tend to be amplified in distilled datasets, while corruption biases are suppressed, highlighting a gap for bias-aware DD. The authors introduce a KDE-based sample reweighting scheme guided by a supervised contrastive embedding to down-weight bias-aligned samples during distillation, achieving substantial performance gains (e.g., CMNIST IPC50: 23.8% → 91.5%). The method is evaluated across multiple DD paradigms and datasets, demonstrating strong improvements over vanilla DD and competitive or superior results to some debiasing baselines, with practical runtime overhead. Together, these results establish bias amplification as a critical issue in DD and provide a practical, generalizable mitigation strategy that can extend to several DD methods.
Abstract
Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset distillation and provide a promising avenue to address bias amplification in the process.
