Table of Contents
Fetching ...

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi

TL;DR

This work tackles prompt-image misalignment in diffusion-based image synthesis by introducing IMG, a re-generation-centric alignment framework. IMG combines an MLLM-driven misalignment detector with an Implicit Aligner that refines diffusion conditioning features to enable better re-generation, trained via an Iteratively Updated Preference Objective $L = L_{ ext{base}} + \lambda L_{ ext{pref}}$ that blends DPO and SPIN within a self-improving reference framework. Trained on Pick-a-Pic and integrated with IP-Adapter, IMG demonstrates strong, plug-and-play compatibility with base and finetuned models (e.g., SDXL, SDXL-DPO, FLUX), achieving substantial improvements over editing-based and finetuning-based baselines across diverse benchmarks. The approach significantly reduces the data and editing requirements typically needed for alignment, offering a scalable and practical path for reliable multimodal diffusion outputs in production-like settings.

Abstract

Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

TL;DR

This work tackles prompt-image misalignment in diffusion-based image synthesis by introducing IMG, a re-generation-centric alignment framework. IMG combines an MLLM-driven misalignment detector with an Implicit Aligner that refines diffusion conditioning features to enable better re-generation, trained via an Iteratively Updated Preference Objective that blends DPO and SPIN within a self-improving reference framework. Trained on Pick-a-Pic and integrated with IP-Adapter, IMG demonstrates strong, plug-and-play compatibility with base and finetuned models (e.g., SDXL, SDXL-DPO, FLUX), achieving substantial improvements over editing-based and finetuning-based baselines across diverse benchmarks. The approach significantly reduces the data and editing requirements typically needed for alignment, offering a scalable and practical path for reliable multimodal diffusion outputs in production-like settings.

Abstract

Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.

Paper Structure

This paper contains 24 sections, 19 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The multimodal misalignment issue. Even the latest state-of-the-art diffusion model, FLUX.1 [dev] (FLUX) flux, may overlook or misinterpret concepts in prompts. Assisted with our proposed Implicit Multimodal Guidance (IMG) framework, the prompt-image misalignment issues are significantly mitigated in various aspects such as concept comprehension, aesthetic quality, object addition, and correction. In each case, both images are generated with the same random seed for fair comparison.
  • Figure 2: Comparison between our Implicit Multimodal Guidance (IMG) and existing editing-based alignment methods. a) Existing methods require additional editing operations that improve alignment in local regions but may compromise overall image quality. b) In contrast, IMG employs a re-generation-based alignment framework by manipulating diffusion conditioning features, ensuring pipeline simplicity and high-quality outputs.
  • Figure 3: Comparison with editing-based methods. We evaluate the performance of Instruct Pix2Pix instructpix2pix and SLD sld with IMG. For Instruct Pix2Pix, the instructions are "add a woman" and "make the ball a rubber ball", generated by our finetuned MLLM.
  • Figure 4: Overview of the Implicit Multimodal Guidance (IMG) framework. Given an initial image that exhibits misalignments with its prompt, IMG begins by conducting an MLLM-driven misalignment analysis. Following this, IMG utilizes an Implicit Aligner to translate the initial image features into better-aligned features according to the MLLM's guidance. Finally, these aligned image features are incorporated as new conditions to re-generate images with improved prompt-image alignment.
  • Figure 5: Qualitative comparison with base models and finetuning-based alignment methods. The first two rows show that IMG addresses various misalignment types across different prompts, while the last row shows that IMG resolves misalignment issues that challenge both models.
  • ...and 10 more figures