Table of Contents
Fetching ...

Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data

Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese, Jan Mathias Koehler, Lukas Schott

TL;DR

The paper addresses the large accuracy gap (~$23$pp) between models trained on synthetic data from Stable Diffusion and real data for ImageNet-100, within the context of data-free knowledge distillation. It employs a layer-wise transfer learning framework, pretraining most layers on synthetic data and fine-tuning the final layers with real data, and conducts extensive ablations on normalization, augmentation, texture cues, and prompt optimization. The key finding is that the drop is predominantly caused by the final layers, with late-layer pretraining on real data yielding substantial gains (e.g., up to $18.4$pp when including layers 15–17); synthetic pretraining of the earlier layers combined with limited real-data fine-tuning improves data efficiency (e.g., using $1/8$ of real data drops accuracy by only $4.2$pp). While certain techniques like unCLIP-based prompt optimization can reduce the gap (≈$26.3$pp improvement over SD v2.1 baselines), none fully closes the gap, highlighting a practical pathway to leverage synthetic data in scenarios with scarce labeled real data and guiding future work on targeted feature analysis and adaptive data strategies.

Abstract

Generative foundation models like Stable Diffusion comprise a diverse spectrum of knowledge in computer vision with the potential for transfer learning, e.g., via generating data to train student models for downstream tasks. This could circumvent the necessity of collecting labeled real-world data, thereby presenting a form of data-free knowledge distillation. However, the resultant student models show a significant drop in accuracy compared to models trained on real data. We investigate possible causes for this drop and focus on the role of the different layers of the student model. By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers. Further, we briefly investigate other factors, such as differences in data-normalization between synthetic and real, the impact of data augmentations, texture vs.\ shape learning, and assuming oracle prompts. While we find that some of those factors can have an impact, they are not sufficient to close the gap towards real data. Building upon our insights that mainly later layers are responsible for the drop, we investigate the data-efficiency of fine-tuning a synthetically trained model with real data applied to only those last layers. Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy. Our findings contribute to the understanding of the gap between synthetic and real data and indicate solutions to mitigate the scarcity of labeled real data.

Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data

TL;DR

The paper addresses the large accuracy gap (~pp) between models trained on synthetic data from Stable Diffusion and real data for ImageNet-100, within the context of data-free knowledge distillation. It employs a layer-wise transfer learning framework, pretraining most layers on synthetic data and fine-tuning the final layers with real data, and conducts extensive ablations on normalization, augmentation, texture cues, and prompt optimization. The key finding is that the drop is predominantly caused by the final layers, with late-layer pretraining on real data yielding substantial gains (e.g., up to pp when including layers 15–17); synthetic pretraining of the earlier layers combined with limited real-data fine-tuning improves data efficiency (e.g., using of real data drops accuracy by only pp). While certain techniques like unCLIP-based prompt optimization can reduce the gap (≈pp improvement over SD v2.1 baselines), none fully closes the gap, highlighting a practical pathway to leverage synthetic data in scenarios with scarce labeled real data and guiding future work on targeted feature analysis and adaptive data strategies.

Abstract

Generative foundation models like Stable Diffusion comprise a diverse spectrum of knowledge in computer vision with the potential for transfer learning, e.g., via generating data to train student models for downstream tasks. This could circumvent the necessity of collecting labeled real-world data, thereby presenting a form of data-free knowledge distillation. However, the resultant student models show a significant drop in accuracy compared to models trained on real data. We investigate possible causes for this drop and focus on the role of the different layers of the student model. By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers. Further, we briefly investigate other factors, such as differences in data-normalization between synthetic and real, the impact of data augmentations, texture vs.\ shape learning, and assuming oracle prompts. While we find that some of those factors can have an impact, they are not sufficient to close the gap towards real data. Building upon our insights that mainly later layers are responsible for the drop, we investigate the data-efficiency of fine-tuning a synthetically trained model with real data applied to only those last layers. Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy. Our findings contribute to the understanding of the gap between synthetic and real data and indicate solutions to mitigate the scarcity of labeled real data.
Paper Structure (24 sections, 1 equation, 4 figures, 4 tables)

This paper contains 24 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: This figure illustrates our transfer learning setup for N = 2. N indicates the number of consecutive layers that are transferred from a model that was trained on the respective first dataset, starting from the first layer.
  • Figure 2: Results of the layer importance experiments using transfer learning from real and synthetic data respectively for N = 1 to N = 17. We also plot the results for our baseline models trained on synthetic and real data only, as these present the two extremes of this experiment setup and therefore provide an approximate lower and upper bound.
  • Figure 3: Data Reduction Experiments The first 16 layers are trained on synthetic data and frozen. The last two layers are fine-tuned with a different number of real data samples. The fitted curves are derived from the plotted top-1 accuracy on the respective scales via curve fitting, using least squares polynomial fitting on a 1st-degree polynomial ($f(x) = -2.23 log(x) + 85.61$).
  • Figure 4: Training on a reduced real dataset Top-1 accuracy of reducing the amount of real training data with (blue, same as \ref{['fig:abl_15_data_reduction']}) and without (orange) synthetic data transfer learning.