Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang; Muxi Diao; Yuanzhi Liu; Chunyu Wang; Kongming Liang; Zhanyu Ma; Jun Guo

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo

TL;DR

The paper tackles the problem that caption length is an unreliable proxy for the visual detail needed to train high-quality text-to-image models. It introduces two metrics, image coverage rate (ICR) and average object detailness (AOD), derived from scene graphs to quantify caption detailness, and combines them into the caption detailness (CD) score. Through experiments on COCO using ShareGPT4V captions, higher ICR and AOD correlate with better image-text alignment and reconstruction, and data filtering using these metrics yields superior performance with only about 20% of the full data compared to training on the full set. The results demonstrate that detail-aware caption selection outperforms length-based filtering, improves DPG and IIW performance, and supports practical data-efficient training for T2I models.

Abstract

Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

TL;DR

Abstract

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)