IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi
TL;DR
This work tackles prompt-image misalignment in diffusion-based image synthesis by introducing IMG, a re-generation-centric alignment framework. IMG combines an MLLM-driven misalignment detector with an Implicit Aligner that refines diffusion conditioning features to enable better re-generation, trained via an Iteratively Updated Preference Objective $L = L_{ ext{base}} + \lambda L_{ ext{pref}}$ that blends DPO and SPIN within a self-improving reference framework. Trained on Pick-a-Pic and integrated with IP-Adapter, IMG demonstrates strong, plug-and-play compatibility with base and finetuned models (e.g., SDXL, SDXL-DPO, FLUX), achieving substantial improvements over editing-based and finetuning-based baselines across diverse benchmarks. The approach significantly reduces the data and editing requirements typically needed for alignment, offering a scalable and practical path for reliable multimodal diffusion outputs in production-like settings.
Abstract
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
