Table of Contents
Fetching ...

SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems

Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh

TL;DR

SYNAuG tackles data imbalance by injecting class-conditioned synthetic samples generated with diffusion models to uniformize training distributions, followed by training on the augmented data and a final last-layer fine-tuning step. The approach acknowledges a domain gap between synthetic and real data and mitigates it through data augmentation and domain Mixup. Empirically, SYNAuG improves performance on long-tailed recognition, fairness, and spurious-correlation robustness, often surpassing task-specific baselines while relying on a few real samples. This data-centric strategy highlights the practical potential of synthetic data for real-world imbalance challenges, while underscoring the need for further domain-gap reduction and controllability of synthetic generation.

Abstract

Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.

SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems

TL;DR

SYNAuG tackles data imbalance by injecting class-conditioned synthetic samples generated with diffusion models to uniformize training distributions, followed by training on the augmented data and a final last-layer fine-tuning step. The approach acknowledges a domain gap between synthetic and real data and mitigates it through data augmentation and domain Mixup. Empirically, SYNAuG improves performance on long-tailed recognition, fairness, and spurious-correlation robustness, often surpassing task-specific baselines while relying on a few real samples. This data-centric strategy highlights the practical potential of synthetic data for real-world imbalance challenges, while underscoring the need for further domain-gap reduction and controllability of synthetic generation.

Abstract

Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.
Paper Structure (16 sections, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of SYNAuG process. Given the imbalanced real-world data with the class labels, we first uniformize the imbalanced real data distribution by generating the synthetic samples that are conditioned on the class label. Second, we train a model with the uniformized training data. Finally, we fine-tune the last layer with the uniformly subsampled real-world data.
  • Figure 2: Replacement test. To investigate the effect on model performance when using original and synthetic data together, we replace the original data with synthetic ones in two ways: (a) class-wise and (b) the same ratio of instances across all classes. We use CIFAR100, which has 500 samples per class and 100 classes.
  • Figure 3: Domain gap between real and synthetic data. We test domain gap empirically with (a) binary domain classification and (b) feature visualization. For classification, we use 2.5k samples for each real and synthetic domain and train only one fully-connected layer on the features extracted from pre-trained model. For visualization, the features are extracted from the pre-trained model on CIFAR100. C1 and C2 denote different classes.
  • Figure 4: Ablation study according to sample quality.(Top) quality of the generated samples according to the number of steps, (Bottom) long-tailed recognition performance (%) according to the different times of steps for generating synthetic data, which affects sample quality. We use ImageNet100-LT with ResNet50.
  • Figure 5: Influence of the class and group imbalance on classifier. The 2D data are sampled from the normal distributions with four different means and the same covariance. We simulate 4 different experiments with the latent group imbalance (sensitive attributes) by adjusting the number of data in each group. We train classifiers for the classes and visualize the learned classifiers (bold black lines). The fairer the classifiers, the more vertically aligned. The classifier trained on the class imbalance is more unfair than the one on the group imbalance.