Table of Contents
Fetching ...

Di3PO -- Diptych Diffusion DPO for Targeted Improvements in Image

Sanjana Reddy, Ishaan Malhi, Sally Ma, Praneet Dutta

TL;DR

Di3PO introduces Diptych Diffusion DPO, a targeted method for constructing positive and negative image pairs that share almost all background content while differing only in a localized region of interest, to improve the efficiency of diffusion model preference tuning. By fixing the background and concentrating gradient updates on the region to be improved, the approach mitigates credit assignment issues and yields faster convergence for tasks like text rendering. The authors implement a two stage pipeline of data generation and rigorous offline filtering to produce high quality diptych pairs without requiring reward models, and demonstrate significant text rendering improvements on SDXL 1.0 compared to SFT and standard DPO baselines using OCR based metrics. The work highlights the sample efficiency and robustness of targeted Diptych pairs, and suggests broad applicability to other localized generation challenges in diffusion models.

Abstract

Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce "Di3PO", a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.

Di3PO -- Diptych Diffusion DPO for Targeted Improvements in Image

TL;DR

Di3PO introduces Diptych Diffusion DPO, a targeted method for constructing positive and negative image pairs that share almost all background content while differing only in a localized region of interest, to improve the efficiency of diffusion model preference tuning. By fixing the background and concentrating gradient updates on the region to be improved, the approach mitigates credit assignment issues and yields faster convergence for tasks like text rendering. The authors implement a two stage pipeline of data generation and rigorous offline filtering to produce high quality diptych pairs without requiring reward models, and demonstrate significant text rendering improvements on SDXL 1.0 compared to SFT and standard DPO baselines using OCR based metrics. The work highlights the sample efficiency and robustness of targeted Diptych pairs, and suggests broad applicability to other localized generation challenges in diffusion models.

Abstract

Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce "Di3PO", a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.
Paper Structure (27 sections, 5 equations, 12 figures, 2 tables)

This paper contains 27 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: We develop Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation - a method to create DPO pairs for image preference tuning with minimal background changes. After finetuning on SDXL-1.0 on a Diptych set targeted for text rendering, our model demonstrates improved text rendering accuracy.
  • Figure 2: Dipytch Image generated using one single prompt, generating Dipytch images ensures consistency of the background, allowing the model to focus on the text rendering
  • Figure 3: Winning and losing image generated using two separate prompts results in different backgrounds
  • Figure 4: Preference pair generation workflow for creating Diptych pairs for DPO tuning.
  • Figure 5: Image generation from various checkpoints during the tuning process
  • ...and 7 more figures