Table of Contents
Fetching ...

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo

TL;DR

The paper tackles the problem that caption length is an unreliable proxy for the visual detail needed to train high-quality text-to-image models. It introduces two metrics, image coverage rate (ICR) and average object detailness (AOD), derived from scene graphs to quantify caption detailness, and combines them into the caption detailness (CD) score. Through experiments on COCO using ShareGPT4V captions, higher ICR and AOD correlate with better image-text alignment and reconstruction, and data filtering using these metrics yields superior performance with only about 20% of the full data compared to training on the full set. The results demonstrate that detail-aware caption selection outperforms length-based filtering, improves DPG and IIW performance, and supports practical data-efficient training for T2I models.

Abstract

Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

TL;DR

The paper tackles the problem that caption length is an unreliable proxy for the visual detail needed to train high-quality text-to-image models. It introduces two metrics, image coverage rate (ICR) and average object detailness (AOD), derived from scene graphs to quantify caption detailness, and combines them into the caption detailness (CD) score. Through experiments on COCO using ShareGPT4V captions, higher ICR and AOD correlate with better image-text alignment and reconstruction, and data filtering using these metrics yields superior performance with only about 20% of the full data compared to training on the full set. The results demonstrate that detail-aware caption selection outperforms length-based filtering, improves DPG and IIW performance, and supports practical data-efficient training for T2I models.

Abstract

Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.

Paper Structure

This paper contains 22 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Caption length is not a perfect indicator of caption detailness. In this paper, we propose a new metric to more effectively estimate caption detailness. Using this metric to select T2I training data surpasses both full-data training and length-based selection methods on the dense prompt graph (DPG) benchmark.
  • Figure 2: Illustration of calculating average object detailness (AOD) and image coverage rate (ICR). The image caption is first parsed into a scene graph, from which we extract the semantic graph for each object. A segmentation model provides object masks. AOD is computed as the average number of triplets per object graph, while ICR is determined by the total area ratio of all objects.
  • Figure 3: An illustration of how scene graphs are sampled to produce captions by varying image coverage rates (ICRs). Starting from the scene graph of a detailed caption. Then sub-graphs of different ICRs are sampled and converted into captions.
  • Figure 4: Generation examples of models trained by captions of different ICR ratios and AOD ratios using DPG prompts.
  • Figure 5: Text-to-image performance of Lumina-Next-T2I fine-tuned by different ratios of ICR and AOD captions.
  • ...and 3 more figures