Table of Contents
Fetching ...

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

Xinyan Chen, Jiaxin Ge, Tianjun Zhang, Jiaming Liu, Shanghang Zhang

TL;DR

This work tackles the difficulty of diffusion-based text-to-image systems in obeying complex spatial prompts. It introduces Iterative Prompt Relabeling (IPR), a four-stage workflow that combines diffusion sampling, detector-based feedback, prompt relabeling, and iterative training to align images with labels more accurately. Using GLIPv2 for feedback and a simple reward rescaling, IPR achieves substantial spatial-accuracy gains (up to 15.22% absolute on VISOR) and competitive CLIP alignment across SDv2/SDXL with LoRA, outperforming RLHF baselines. The approach is plug-and-play, data-efficient, and demonstrates robust generalization across spatial relation types, while also highlighting trade-offs between spatial precision and global image fidelity.

Abstract

Diffusion models have shown impressive performance in many domains. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods. Our code is publicly available at https://github.com/xinyan-cxy/IPR-RLDF.

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

TL;DR

This work tackles the difficulty of diffusion-based text-to-image systems in obeying complex spatial prompts. It introduces Iterative Prompt Relabeling (IPR), a four-stage workflow that combines diffusion sampling, detector-based feedback, prompt relabeling, and iterative training to align images with labels more accurately. Using GLIPv2 for feedback and a simple reward rescaling, IPR achieves substantial spatial-accuracy gains (up to 15.22% absolute on VISOR) and competitive CLIP alignment across SDv2/SDXL with LoRA, outperforming RLHF baselines. The approach is plug-and-play, data-efficient, and demonstrates robust generalization across spatial relation types, while also highlighting trade-offs between spatial precision and global image fidelity.

Abstract

Diffusion models have shown impressive performance in many domains. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods. Our code is publicly available at https://github.com/xinyan-cxy/IPR-RLDF.
Paper Structure (47 sections, 12 figures, 13 tables, 1 algorithm)

This paper contains 47 sections, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: A high-level overview of our approach. We enhance the alignment of images with text through an iterative process of image sampling and prompt relabeling.
  • Figure 2: The general pipeline of IPR. Our approach adopts four different stages: (1) diffusion model sampling, (2) reward-based loss rescaling, (3) prompt relabeling, and (4) iterative training.
  • Figure 3: Visual comparison of the original SDXL model with fine-tuned versions using RLDF, PR-RLDF, and IPR-RLDF, across four different prompts. Our algorithm demonstrates superior spatial awareness and accuracy in object depiction, while sacrificing some details.
  • Figure 4: The process of our IPR algorithm. (1) Sampling Images from Diffusion Models: sample images from a diffusion model conditioned on textual prompts. (2) Prompt Relabeling: detect the generated image to yield a bounding box; analyze the box to modify original prompts. (3) Detection-Based Loss Re-scaling: apply a detection model to rescale the loss function. (4) Iterative Training: retrain the model with the updated dataset iteratively.
  • Figure 5: Samples from different models fine-tuned with IPR-RLDF, generated by two distinct prompts. (1) Left column: unfrozen fine-tuning on SDv2. (2)Mid column: using LoRA to fine-tune SDv2. (3) Right column: using LoRA to fine-tune SDXL. The LoRA training exhibits more notable image fidelity than unfrozen training.
  • ...and 7 more figures