Table of Contents
Fetching ...

Mitigating Bias in Dataset Distillation

Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

TL;DR

This paper investigates how biases in the original dataset propagate through dataset distillation (DD) and how they can be mitigated. It shows that color and background biases tend to be amplified in distilled datasets, while corruption biases are suppressed, highlighting a gap for bias-aware DD. The authors introduce a KDE-based sample reweighting scheme guided by a supervised contrastive embedding to down-weight bias-aligned samples during distillation, achieving substantial performance gains (e.g., CMNIST IPC50: 23.8% → 91.5%). The method is evaluated across multiple DD paradigms and datasets, demonstrating strong improvements over vanilla DD and competitive or superior results to some debiasing baselines, with practical runtime overhead. Together, these results establish bias amplification as a critical issue in DD and provide a practical, generalizable mitigation strategy that can extend to several DD methods.

Abstract

Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset distillation and provide a promising avenue to address bias amplification in the process.

Mitigating Bias in Dataset Distillation

TL;DR

This paper investigates how biases in the original dataset propagate through dataset distillation (DD) and how they can be mitigated. It shows that color and background biases tend to be amplified in distilled datasets, while corruption biases are suppressed, highlighting a gap for bias-aware DD. The authors introduce a KDE-based sample reweighting scheme guided by a supervised contrastive embedding to down-weight bias-aligned samples during distillation, achieving substantial performance gains (e.g., CMNIST IPC50: 23.8% → 91.5%). The method is evaluated across multiple DD paradigms and datasets, demonstrating strong improvements over vanilla DD and competitive or superior results to some debiasing baselines, with practical runtime overhead. Together, these results establish bias amplification as a critical issue in DD and provide a practical, generalizable mitigation strategy that can extend to several DD methods.

Abstract

Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset distillation and provide a promising avenue to address bias amplification in the process.
Paper Structure (35 sections, 6 equations, 18 figures, 8 tables)

This paper contains 35 sections, 6 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Workflow of our method that utilizes Supervised Contrastive model and Kernel Density Estimation to mitigate bias in the dataset distillation process.
  • Figure 2: The left most 2 bars indicate the model performance on full dataset with no distillation. For DSA/DM/MTT, the blue bar shows the model performance on the unbiased dataset and the red bar shows the performance of the corresponding dataset distillation method on that biased dataset with 5% bias-conflicting samples. The distillation performances are measured under IPC 10.
  • Figure 3: Ablation study on Kernel variance and temperature on CMNIST with 5% bias-conflicting samples and IPC 10.
  • Figure 4: Synthetic images from vanilla DM (left) vs Ours (right) distilled from CMNIST with 5% bias-conflict samples. The one synthesized by vanilla DM is dominated by the bias feature while ours includes a rich set of features from both biased and bias-conflict samples.
  • Figure 5: KDE applied on a normal distribution.
  • ...and 13 more figures