Table of Contents
Fetching ...

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, Min Zhang

TL;DR

This work tackles the challenge of aligning text-to-image diffusion outputs with human preferences in a computationally efficient way. It introduces DDIM-InPO, which reframes diffusion as a single-step generative process and uses a reparameterization to assign implicit rewards to latent variables, coupled with an inversion-based method to select highly informative latent variables for fine-tuning. The method achieves state-of-the-art human-preference performance with only about 400 training steps and demonstrates substantial gains in both efficiency and generation quality over existing baselines. The approach is validated on SD1.5 and SDXL using the Pick-a-Pic v2 and HPDv2 datasets, and shows strong transfer to conditional generation tasks via ControlNet, underscoring its practical impact for rapid, preference-aligned diffusion-model deployment.

Abstract

Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

TL;DR

This work tackles the challenge of aligning text-to-image diffusion outputs with human preferences in a computationally efficient way. It introduces DDIM-InPO, which reframes diffusion as a single-step generative process and uses a reparameterization to assign implicit rewards to latent variables, coupled with an inversion-based method to select highly informative latent variables for fine-tuning. The method achieves state-of-the-art human-preference performance with only about 400 training steps and demonstrates substantial gains in both efficiency and generation quality over existing baselines. The approach is validated on SD1.5 and SDXL using the Pick-a-Pic v2 and HPDv2 datasets, and shows strong transfer to conditional generation tasks via ControlNet, underscoring its practical impact for rapid, preference-aligned diffusion-model deployment.

Abstract

Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.

Paper Structure

This paper contains 40 sections, 31 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: We develop DDIM-InPO, an efficient method to align diffusion models with human preference. It suffices to directly optimize the outputs of a small set of variables using human feedback data. This figure illustrates the results after 400 fine-tuning steps of SDXL-base-1.0 using our method, showing that the generated images exhibit strong visual appeal and align well with human preferences.
  • Figure 2: Illustration of Inversion for Preference Optimization.
  • Figure 3: Top, in human evaluations, InPO-SDXL shows a marked improvement over both DPO-SDXL and Base-SDXL. Bottom, qualitative comparisons among baselines. InPO-SDXL achieves superior prompt alignment and produces images of higher quality.
  • Figure 4: Comparison of the trade-off between the quality of generated images and training efficiency following human preference optimization of SD1.5 on the HPDv2 test set. Sizes of the circles represent the volume of training data used. Our DDIM-InPO achieves superior performance, with a training speed that is 18.4 and 3.6 times faster than Diffusion-KTO li2024aligning and Diffusion-DPO wallace2024diffusion, respectively, while producing images of higher quality.
  • Figure 5: Advantages of DDIM-InPO: Compared with Refinement technique and Diffusion-DPO, we find that images generated by DDIM-InPO exhibit enhanced light and spatial structure, better realism and detail capture, greater imaginative design, stable color consistency, optimized multi-instance layout and text integration in visuals. These are some hidden advantages aligned with human preferences. Prompts from left to right: (1) A small bird sitting in a metal wheel. (2) Trees seen through a car window on a rainy day. (3) Description, An artistic rendering of a cosmic portal with a beach at dusk on the other side. (4) A person in full samurai armor at the beach. (5) A man standing in front of a bunch of doughnuts. (6) A towel with the word ‘ cat' printed on it, simple and clear text.
  • ...and 12 more figures