Table of Contents
Fetching ...

How far can we go with ImageNet for Text-to-Image generation?

L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton

TL;DR

The paper questions the necessity of billion-scale datasets for text-to-image generation and shows that a 400M-parameter diffusion model trained exclusively on ImageNet, with carefully designed long captions and image augmentations, can achieve competitive image quality and strong benchmark performance. By adapting DiT-I and CAD-I architectures and introducing text- and image-augmentation strategies (notably long captions and CutMix/Crop), the authors build a scalable, reproducible training recipe that delivers high compositionality and outperforms several data-heavy baselines on GenEval and DPGBench. Scaling to higher resolution (512^2) further strengthens performance, with qualitative and quantitative results rivaling or surpassing models trained on much larger corpora. The work demonstrates that ImageNet can support general T2I capabilities and task-specific fine-tuning (e.g., aesthetics) within accessible compute, suggesting a path toward more sustainable and open research in text-to-image generation.

Abstract

Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

How far can we go with ImageNet for Text-to-Image generation?

TL;DR

The paper questions the necessity of billion-scale datasets for text-to-image generation and shows that a 400M-parameter diffusion model trained exclusively on ImageNet, with carefully designed long captions and image augmentations, can achieve competitive image quality and strong benchmark performance. By adapting DiT-I and CAD-I architectures and introducing text- and image-augmentation strategies (notably long captions and CutMix/Crop), the authors build a scalable, reproducible training recipe that delivers high compositionality and outperforms several data-heavy baselines on GenEval and DPGBench. Scaling to higher resolution (512^2) further strengthens performance, with qualitative and quantitative results rivaling or surpassing models trained on much larger corpora. The work demonstrates that ImageNet can support general T2I capabilities and task-specific fine-tuning (e.g., aesthetics) within accessible compute, suggesting a path toward more sustainable and open research in text-to-image generation.

Abstract

Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

Paper Structure

This paper contains 37 sections, 4 equations, 16 figures, 14 tables, 1 algorithm.

Figures (16)

  • Figure 1: Images generated by our 400M parameters text-to-image model trained solely on ImageNet. Text prompts are taken from PartiPromptsyu2022scalingautoregressivemodelscontentrich.
  • Figure 2: Quantitative results on GenEval (left) and DPGBench (right). The size of the bubble represents the number of parameters. In both cases, we outperform models of $10\times$ the parameters and trained on $1000\times$ the number of images.
  • Figure 3: Training dynamics showing FID and GenEval scores vs training steps. TA + IA maintains better scores throughout training compared to TA only, demonstrating improved resistance to overfitting. Lower FID scores indicate better image quality. Better GenEval scores indicate better compositionality abilities.
  • Figure 4: Qualitative comparison: Text-Augmentation (TA, first, third columns) vs Text+Image Augmentation (TA+IA second, last columns)) for four prompts (left and right blocks per row). Image augmentation improves text comprehension, compositionality and overall image quality.
  • Figure 5: Comparison with SOTA models at $1024^2$ resolution. Each row shows the same prompt rendered by four different models: Ours, SDXL, Pixart-$\alpha$, and SD3-Medium. The prompt is taken from ImageRewardsxu2023imagereward. Additional comparisons are shown in Figure \ref{['fig:visual_sota_extended']}
  • ...and 11 more figures