Table of Contents
Fetching ...

Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

Zeliang Zhang, Xin Liang, Mingqian Feng, Susan Liang, Chenliang Xu

TL;DR

This work investigates the effects of the generated data on image classification tasks, with a specific focus on bias, and develops a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically.

Abstract

As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: Will this practice amplify bias in future models? While most research has focused on overall performance, the impact on model bias, particularly subgroup bias, remains underexplored. In this work, we investigate the effects of the generated data on image classification tasks, with a specific focus on bias. We develop a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets to reveal changes in fairness metrics across generations. In addition, we provide a conjecture to explain the bias dynamics when training models on continuously augmented datasets across generations. Our findings contribute to the ongoing debate on the implications of synthetic data for fairness in real-world applications.

Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

TL;DR

This work investigates the effects of the generated data on image classification tasks, with a specific focus on bias, and develops a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically.

Abstract

As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: Will this practice amplify bias in future models? While most research has focused on overall performance, the impact on model bias, particularly subgroup bias, remains underexplored. In this work, we investigate the effects of the generated data on image classification tasks, with a specific focus on bias. We develop a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets to reveal changes in fairness metrics across generations. In addition, we provide a conjecture to explain the bias dynamics when training models on continuously augmented datasets across generations. Our findings contribute to the ongoing debate on the implications of synthetic data for fairness in real-world applications.

Paper Structure

This paper contains 16 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Generative models can be leveraged to generate more data to augment the training set, then help the downstream models training.
  • Figure 2: We continuously leverage new generators to produce additional images that enhance the training process, employing data stacking and expert-guided filtering to maintain high quality. We highlight the trajectory of the self-consuming loop in red.
  • Figure 3: Results on the models trained on the MNIST dataset with unbiased initialization.
  • Figure 4: Results on the models trained on the MNIST dataset with biased initialization.
  • Figure 5: Results on the models trained from scratch on the CIFAR-20/100 dataset.
  • ...and 3 more figures