Table of Contents
Fetching ...

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

Sheng Cheng, Maitreya Patel, Yezhou Yang

TL;DR

The analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact than recall, and utilizes Large Vision Language Models to generate synthetic captions for training.

Abstract

Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

TL;DR

The analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact than recall, and utilizes Large Vision Language Models to generate synthetic captions for training.

Abstract

Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The result of the compositional capabilities across various combinations of precision and recall on human-annotated captions. Positive sentences indicate the precision of the captions, while the number of submasks represents the comprehensiveness of the captions.
  • Figure 2: One sample from the DAC dataset, used for analysis of human-annotated captions.
  • Figure 3: One sample from MSCOCO, used for generating synthetic captions by LVLM.