Table of Contents
Fetching ...

Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman

TL;DR

The paper addresses whether full synthetic augmentation is necessary for high in-distribution generalization. It develops a theoretical framework showing that targeting slow-learnable features and augmenting them with faithful diffusion-generated images yields faster convergence and better generalization than upsampling or augmenting the entire dataset. Empirically, the method improves robustness and accuracy across ResNet, ViT, ConvNeXt, and Swin architectures on CIFAR-10/100 and TinyImageNet, often outperforming SAM with SGD. The approach is efficient (30-40% data augmentation) and complements existing weak/strong augmentation strategies, offering a practically impactful tactic for diffusion-based data augmentation. It highlights the importance of isolating and reinforcing slower-learning features while controlling noise amplification in synthetic data.

Abstract

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

TL;DR

The paper addresses whether full synthetic augmentation is necessary for high in-distribution generalization. It develops a theoretical framework showing that targeting slow-learnable features and augmenting them with faithful diffusion-generated images yields faster convergence and better generalization than upsampling or augmenting the entire dataset. Empirically, the method improves robustness and accuracy across ResNet, ViT, ConvNeXt, and Swin architectures on CIFAR-10/100 and TinyImageNet, often outperforming SAM with SGD. The approach is efficient (30-40% data augmentation) and complements existing weak/strong augmentation strategies, offering a practically impactful tactic for diffusion-based data augmentation. It highlights the importance of isolating and reinforcing slower-learning features while controlling noise amplification in synthetic data.

Abstract

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

Paper Structure

This paper contains 22 sections, 13 theorems, 50 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

With controlled logit terms $l_i^{(t)} = \text{sigmoid}(-y_if({\bm{x}}_i; {\bm{W}}^{(t)}))$, large data size $N$, small learning rate $\eta$, and small SAM perturbation parameter $\rho$ (see Appendix appx:proofs), SAM and GD updates from the same parameters have the following property, early in trai A special case of this theorem is that with the same initializations ${\bm{W}}^{(0)} \sim \mathcal{

Figures (7)

  • Figure 1: Examples of slow- and fast-learnable images and our faithful synthetic images corresponding to slow-learnable examples generated for CIFAR-10. Our synthetic data preserves features in slow-learnable images but replace noise. This amplifies slow-learnable features without magnifying noise. This is difficult to achieve with standard augmentations like random cropping or flipping, highlighting the value of generative augmentation. Additional images are given in Figure \ref{['fig:qualitative_results_full']}.
  • Figure 2: Test classification error of ResNet18 on CIFAR10, CIFAR100 and TinyImageNet. For Upsample, we use a factor of $k=2$, as higher $k$ harms the performance. In contrast for our method (Ours), $k=5, 5, 4$ for CIFAR10, CIFAR100, and Tiny-ImageNet, respectively. Our method improves both SGD and SAM. Notably, it enables SGD to outperform SAM on CIFAR100 and TinyImageNet.
  • Figure 3: Test classification error of VGG19, DenseNet121, and ViT-S on CIFAR10. For Upsample, we use a factor of $k=2$---as higher $k$ hurst the performance---while for Ours, we use $k=5$.
  • Figure 4: (left & middle) Comparison between different synthetic image augmentation strategies when training ResNet18 on CIFAR10 and CIFAR100. For Syn-rand and Ours, we use $k=2$ resulting in only 30% and 40% additional examples compared to 100% of Syn-all. (right) Our method with $k = 5$ can be stacked with TrivialAugment (TA) to further boosts the performance when training ResNet18 on CIFAR10, achieving (to our knowledge) SOTA test classification error.
  • Figure 5: Training ResNet18 on CIFAR10. (left) The effect of amplification factor $k$ on test error for upsampling vs generation. Red points indicate the optimal choice of $k$. $k>2$ hurts upsampling but boosts generation. (middle) Generating synthetic CIFAR10 images from real images outperform starting from random noise. (right) Effect of the number of denoising steps on the performance with $k = 2$.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Definition 3.1: Data distribution
  • Definition 3.2: Two-layer CNN
  • Theorem 4.1
  • Theorem 4.2: Comparison of feature & noise learning
  • Theorem 4.3: Variance of mini-batch gradients
  • Corollary 4.4
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 12 more