Table of Contents
Fetching ...

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu

TL;DR

This work addresses the challenge of aligning text prompts with diffusion-generated images when prompts demand specific object categories and quantities. It introduces a three-step methodology: constructing a 1,700-prompt compositional dataset, deriving a differentiable reward from an object detector based on category and quantity confidences, and fine-tuning a diffusion model by backpropagating reward gradients while balancing with the original pretraining loss. The proposed CQ_Score reward, combining Acc and Aqc, guides refinement toward semantically accurate multi-object scenes, achieving superior alignment and image fidelity versus strong baselines. The results are supported by both quantitative metrics (CLIP, BLIP, CQ, FID) and qualitative examples, and the work provides a dataset and metric for evaluating compositional generation in text-to-image models.

Abstract

Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at https://github.com/kingniu0329/Visions.

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

TL;DR

This work addresses the challenge of aligning text prompts with diffusion-generated images when prompts demand specific object categories and quantities. It introduces a three-step methodology: constructing a 1,700-prompt compositional dataset, deriving a differentiable reward from an object detector based on category and quantity confidences, and fine-tuning a diffusion model by backpropagating reward gradients while balancing with the original pretraining loss. The proposed CQ_Score reward, combining Acc and Aqc, guides refinement toward semantically accurate multi-object scenes, achieving superior alignment and image fidelity versus strong baselines. The results are supported by both quantitative metrics (CLIP, BLIP, CQ, FID) and qualitative examples, and the work provides a dataset and metric for evaluating compositional generation in text-to-image models.

Abstract

Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at https://github.com/kingniu0329/Visions.

Paper Structure

This paper contains 15 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The steps in our fine-tuning method. (1) We create a text-to-image dataset containing different kinds of compositions. (2) We construct a reward from specific feedback, derived from the confidence in object category and quantity. (3) The diffusion model is fine-tuned by backpropagation reward function gradients to overcome text-image mismatch.
  • Figure 2: Comparison results for text-image alignment in three kinds of compositions on three metrics, including Normal, Awkward and Unlikely.
  • Figure 3: Qualitative comparison with three SOTA methods in three kinds of compositions.
  • Figure 4: Comparison of images generated by the original SD v1.5 ref38, ImageReward ref15, DDPO ref22, and our method under the type of composition of “Fixed Category & Incremental quantity”. Images in the same row are generated with the same random seed. The prompts for the sample images generated from the first row to the fourth row are: “A sheep on the prairie”, “ Two sheep on the prairie”, “Three sheep on the prairie”, and “Four sheep on the prairie”.
  • Figure 5: Comparison of images generated by the original SD v1.5 ref38, ImageReward ref15, DDPO ref22, and our method under the type of composition of “Fixed Quantity & Incremental Category”. Images in the same column are generated with the same random seed. The prompts for the sample images generated from the first row to the fourth row are: “Cattle on the estate”, “Cattle and sheep on the estate” , “Cattle, sheep, and chicken on the estate” , and “Cattle, sheep, chicken, and geese on the estate”.
  • ...and 2 more figures