Improving Compositional Text-to-image Generation with Large Vision-Language Models

Song Wen; Guian Fang; Renrui Zhang; Peng Gao; Hao Dong; Dimitris Metaxas

Improving Compositional Text-to-image Generation with Large Vision-Language Models

Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas

TL;DR

This work tackles the challenge of compositional text-to-image generation in diffusion models by introducing a plug-and-play framework that (i) uses large vision-language models (LVLMs) to evaluate image-text alignment across object count, attribute binding, spatial relations, and aesthetics, (ii) fine-tunes latent diffusion models with Reward Feedback Learning driven by LVLM-based accuracy, and (iii) employs LVLM-guided editing at inference using SAM and diffusion-based inpainting to iteratively correct misalignments. The approach yields improved alignment and image quality on compositional prompts, validated by both quantitative metrics and qualitative editing demonstrations. It provides a practical, end-to-end strategy to enhance text-image fidelity in complex prompts, with potential for broader adoption in compositional generation tasks.

Abstract

Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further misalignments are detected by the LVLM. The resultant image is consequently more closely aligned with the input text. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation, particularly with respect to object number, attribute binding, spatial relationships, and aesthetic quality.

Improving Compositional Text-to-image Generation with Large Vision-Language Models

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 6 figures, 1 algorithm)

This paper contains 15 sections, 7 equations, 6 figures, 1 algorithm.

Introduction
Related Work
Method
Preliminary
Overview
LVLM-based Evaluation
Model Fine-tuning
LVLM-guided Editing
Experiment
Implementation Detail
Experimental Results
Conclusion
Additional Related Work
Large Vision-Language Models (LVLMs).
Additional Visulization

Figures (6)

Figure 1: Illustrating Limitations in Compositional Text-to-Image Generation. (a) Object Number: The discrepancy between the quantity of objects in the image (e.g., cat and dog) and the input text is evident. (b) Attribute Binding: The attributes of objects depicted do not correspond with the input text; for instance, the cat’s color is black and white, contrasting with the specified black. (c) Spatial Relationship: The arrangement of objects does not conform to the input text, with the suitcase not situated to the right of the cow as described. (d) Aesthetic Quality: The representation of the object is distorted, deviating from conventional aesthetic standards.
Figure 2: Overview of the Proposed Methodology. Our methodology is structured around three core components: (1) LVLM-based Evaluation: Drawing inspiration from TIFA, we initially employ LLM to formulate question-answer pairs grounded in the input text. Subsequently, the LVLM is utilized to procure answers by processing the formulated questions alongside the image. A comparative analysis of answers derived from both image and text is then undertaken to calculate the answer accuracy, serving as our evaluative metric. (2) Model Fine-tuning: The LVLM-based evaluation metric is incorporated as a weight within the diffusion loss function, facilitating the fine-tuning of the diffusion model. The objective is to guide the diffusion model's focus towards enhancing answer accuracy. (3) LVLM-guided Editing: In the inference phase, the LVLM is deployed to identify misalignments between image and text. Subsequent to this identification, image-editing algorithms are applied iteratively to rectify the image until no alignment is detected.
Figure 3: The answers produced by Bard. The images are generated with the text "a black dog is standing on a beach".
Figure 4: The images generated by Stable Diffusion and the fine-tuned model.
Figure 5: Visualization of LVLM-guided Editing.
...and 1 more figures

Improving Compositional Text-to-image Generation with Large Vision-Language Models

TL;DR

Abstract

Improving Compositional Text-to-image Generation with Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)