Table of Contents
Fetching ...

Exploring the Impact of Dataset Bias on Dataset Distillation

Yao Lu, Jianyang Gu, Xuguang Chen, Saeed Vahidian, Qi Xuan

TL;DR

Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset in most cases, which highlights the necessity of identifying and mitigating biases in the original datasets during DD.

Abstract

Dataset Distillation (DD) is a promising technique to synthesize a smaller dataset that preserves essential information from the original dataset. This synthetic dataset can serve as a substitute for the original large-scale one, and help alleviate the training workload. However, current DD methods typically operate under the assumption that the dataset is unbiased, overlooking potential bias issues within the dataset itself. To fill in this blank, we systematically investigate the influence of dataset bias on DD. To the best of our knowledge, this is the first exploration in the DD domain. Given that there are no suitable biased datasets for DD, we first construct two biased datasets, CMNIST-DD and CCIFAR10-DD, to establish a foundation for subsequent analysis. Then we utilize existing DD methods to generate synthetic datasets on CMNIST-DD and CCIFAR10-DD, and evaluate their performance following the standard process. Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset in most cases, which highlights the necessity of identifying and mitigating biases in the original datasets during DD. Finally, we reformulate DD within the context of a biased dataset. Our code along with biased datasets are available at https://github.com/yaolu-zjut/Biased-DD.

Exploring the Impact of Dataset Bias on Dataset Distillation

TL;DR

Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset in most cases, which highlights the necessity of identifying and mitigating biases in the original datasets during DD.

Abstract

Dataset Distillation (DD) is a promising technique to synthesize a smaller dataset that preserves essential information from the original dataset. This synthetic dataset can serve as a substitute for the original large-scale one, and help alleviate the training workload. However, current DD methods typically operate under the assumption that the dataset is unbiased, overlooking potential bias issues within the dataset itself. To fill in this blank, we systematically investigate the influence of dataset bias on DD. To the best of our knowledge, this is the first exploration in the DD domain. Given that there are no suitable biased datasets for DD, we first construct two biased datasets, CMNIST-DD and CCIFAR10-DD, to establish a foundation for subsequent analysis. Then we utilize existing DD methods to generate synthetic datasets on CMNIST-DD and CCIFAR10-DD, and evaluate their performance following the standard process. Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset in most cases, which highlights the necessity of identifying and mitigating biases in the original datasets during DD. Finally, we reformulate DD within the context of a biased dataset. Our code along with biased datasets are available at https://github.com/yaolu-zjut/Biased-DD.
Paper Structure (11 sections, 6 equations, 3 figures, 2 tables)

This paper contains 11 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Visualizations of bias-conflicting samples and bias-aligned samples. Figure (a) and (f) visualize the bias-conflicting samples in CMNIST-DD and CCIFAR10-DD, respectively. Figure (b)-(e) and (g)-(j) visualize the bias-aligned samples with various severities in CMNIST-DD and CCIFAR10-DD, respectively. Severity increases from top to bottom. As for CCIFAR10-DD, we add 10 types of corruptions to 10 categories of CIFAR10. Specifically, “snow” for “airplane”, “frost” for “automobile”, “fog” for “bird”, “brightness” for “cat”, “contrast” for “deer”, “spatter” for “dog”, “elastic” for “frog”, “JPEG” for “horse”, “pixelate” for “ship” and “saturate” for “truck”. Best viewed in color.
  • Figure 2: Visualizations of synthetic datasets generated by various DD methods on CMNIST-DD. All experiments are conducted at a severity level of 4.
  • Figure 3: Visualizations of synthetic datasets generated by various DD methods on CCIFAR10-DD. All experiments are conducted at a severity level of 4.