Table of Contents
Fetching ...

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, Stefan Roth

TL;DR

We address the data labeling bottleneck by evaluating diffusion-generated synthetic data as training regimes for three synthetic-clone classes (supervised, self-supervised, and multi-modal). Our robustness benchmark compares these clones against strong real-image baselines across calibration, OOD detection, adversarial robustness, and common corruptions, with ablations on prompts and data mixing. We find that self-supervised and multi-modal synthetic clones can match or exceed real-data baselines on several robustness metrics (e.g., calibration and bias measures), but are generally more vulnerable to adversarial and common corruptions, while supervised synthetic clones lag behind on multiple metrics. Mixing real and synthetic data, and using richer prompts (captions or CLIP templates) significantly improves robustness, suggesting practical deployment should combine data sources and carefully design generation prompts.

Abstract

A long-standing challenge in developing machine learning approaches has been the lack of high-quality labeled data. Recently, models trained with purely synthetic data, here termed synthetic clones, generated using large-scale pre-trained diffusion models have shown promising results in overcoming this annotation bottleneck. As these synthetic clone models progress, they are likely to be deployed in challenging real-world settings, yet their suitability remains understudied. Our work addresses this gap by providing the first benchmark for three classes of synthetic clone models, namely supervised, self-supervised, and multi-modal ones, across a range of robustness measures. We show that existing synthetic self-supervised and multi-modal clones are comparable to or outperform state-of-the-art real-image baselines for a range of robustness metrics - shape bias, background bias, calibration, etc. However, we also find that synthetic clones are much more susceptible to adversarial and real-world noise than models trained with real data. To address this, we find that combining both real and synthetic data further increases the robustness, and that the choice of prompt used for generating synthetic images plays an important part in the robustness of synthetic clones.

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

TL;DR

We address the data labeling bottleneck by evaluating diffusion-generated synthetic data as training regimes for three synthetic-clone classes (supervised, self-supervised, and multi-modal). Our robustness benchmark compares these clones against strong real-image baselines across calibration, OOD detection, adversarial robustness, and common corruptions, with ablations on prompts and data mixing. We find that self-supervised and multi-modal synthetic clones can match or exceed real-data baselines on several robustness metrics (e.g., calibration and bias measures), but are generally more vulnerable to adversarial and common corruptions, while supervised synthetic clones lag behind on multiple metrics. Mixing real and synthetic data, and using richer prompts (captions or CLIP templates) significantly improves robustness, suggesting practical deployment should combine data sources and carefully design generation prompts.

Abstract

A long-standing challenge in developing machine learning approaches has been the lack of high-quality labeled data. Recently, models trained with purely synthetic data, here termed synthetic clones, generated using large-scale pre-trained diffusion models have shown promising results in overcoming this annotation bottleneck. As these synthetic clone models progress, they are likely to be deployed in challenging real-world settings, yet their suitability remains understudied. Our work addresses this gap by providing the first benchmark for three classes of synthetic clone models, namely supervised, self-supervised, and multi-modal ones, across a range of robustness measures. We show that existing synthetic self-supervised and multi-modal clones are comparable to or outperform state-of-the-art real-image baselines for a range of robustness metrics - shape bias, background bias, calibration, etc. However, we also find that synthetic clones are much more susceptible to adversarial and real-world noise than models trained with real data. To address this, we find that combining both real and synthetic data further increases the robustness, and that the choice of prompt used for generating synthetic images plays an important part in the robustness of synthetic clones.
Paper Structure (15 sections, 2 figures, 9 tables)

This paper contains 15 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Setups for training different classes of models using synthetic images. Supervised learning (bottom) uses the ground-truth label for conditionally generating a synthetic image, while self-supervised (top left) and multi-modal methods (top right) make use of a concept bank along with a large language model (LLM) for prompt generation. Please see text for more details.
  • Figure 2: Test error vs. ECE for ID and OOD datasets.We report the resulting ECE metric and test error metrics for both ID (ImageNet) and OOD datasets (ImagNet-{R,A}). Filled markers indicate real models, empty markers indicate synthetic clones.