Table of Contents
Fetching ...

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv Leviathan

TL;DR

This work identifies a key bottleneck in text-to-image diffusion models: training captions derived from open-web alt-text are often misaligned with inference prompts and lack detail. It presents RECAP, a three-step recaptioning pipeline that fine-tunes an image-to-text captioner on a small human-caption set, uses it to relabel a large training corpus, and retrains a T2I model on the recaptioned data. Across automated metrics and human evaluations on MS-COCO and DrawBench, RECAP yields substantial gains in image fidelity and semantic alignment, reducing train-inference skew and improving sample efficiency. The results suggest high-quality synthetic captions are a powerful lever for improving cross-modal understanding in diffusion models and open avenues for domain-specific captioning and larger-scale training.

Abstract

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

TL;DR

This work identifies a key bottleneck in text-to-image diffusion models: training captions derived from open-web alt-text are often misaligned with inference prompts and lack detail. It presents RECAP, a three-step recaptioning pipeline that fine-tunes an image-to-text captioner on a small human-caption set, uses it to relabel a large training corpus, and retrains a T2I model on the recaptioned data. Across automated metrics and human evaluations on MS-COCO and DrawBench, RECAP yields substantial gains in image fidelity and semantic alignment, reducing train-inference skew and improving sample efficiency. The results suggest high-quality synthetic captions are a powerful lever for improving cross-modal understanding in diffusion models and open avenues for domain-specific captioning and larger-scale training.

Abstract

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.
Paper Structure (27 sections, 13 figures, 5 tables)

This paper contains 27 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Examples of non-cherrypicked generations from the base Stable Diffusion model (left) and our model trained on a recaptioned dataset (right), on the same set of random seeds.
  • Figure 2: Schematic diagram of our method RECAP. In steps (1a) and (1b) we fine-tune an image-to-text captioning model on a small set of detailed human captions. In step (2) we use this fine-tuned model to recaption the images in the training dataset of a text-to-image model, and with this dataset, in step (3) we train an image generation model with the recaptioned dataset.
  • Figure 3: Examples of captions given by human raters, and the automatically generated caption from the non fine-tuned PaLI model. Photos taken from LAION.
  • Figure 4: Examples of captions generated by the RECAP model conditioned on the short or long prefixes, the original PaLI model, and the original Alttext captions. Photos taken from LAION.
  • Figure 5: FID (left) and SOA (right) scores for various checkpoints of the Alttext model vs. RECAP models, comparing models that trained on different sets of captions. 0 fine-tuning steps is vanilla Stable Diffusion 1.4. Lower FID is better. We see that RECAP Short achieves better FID and faster, but with no semantic improvement. RECAP Long achieves some FID improvement with significant semantic improvement, and RECAP Mix achieves both.
  • ...and 8 more figures