Table of Contents
Fetching ...

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu

TL;DR

PPAD tackles semantic drift in diffusion-based text-to-image generation by integrating a Multimodal LLM as an in-process observer and corrector. It introduces Lookahead Sketch Generator, Semantic Corrected Knowledge, and Ping-Pong-Ahead to provide real-time, interpretable feedback that steers the denoising trajectory with minimal overhead, and it supports both inference-only and training-enhanced setups. The authors provide theoretical guarantees on error propagation under a minimum SNR and demonstrate significant improvements on Drawbench and Pick-a-Pic across multiple backbones and metrics. This work demonstrates a practical path to tightly couple language-grounded semantic supervision with diffusion models, enabling more faithful, controllable, and transparent image synthesis.

Abstract

Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

TL;DR

PPAD tackles semantic drift in diffusion-based text-to-image generation by integrating a Multimodal LLM as an in-process observer and corrector. It introduces Lookahead Sketch Generator, Semantic Corrected Knowledge, and Ping-Pong-Ahead to provide real-time, interpretable feedback that steers the denoising trajectory with minimal overhead, and it supports both inference-only and training-enhanced setups. The authors provide theoretical guarantees on error propagation under a minimum SNR and demonstrate significant improvements on Drawbench and Pick-a-Pic across multiple backbones and metrics. This work demonstrates a practical path to tightly couple language-grounded semantic supervision with diffusion models, enabling more faithful, controllable, and transparent image synthesis.

Abstract

Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

Paper Structure

This paper contains 26 sections, 2 theorems, 25 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Consider the reverse denoising process of DDIM under the following conditions. 1) The noise prediction model $\mathcal{M}_\theta^{\mathrm{DM}}(\mathbf{x}_t,t,\mathbf{c})$ exhibits bounded prediction error: $\forall t$, $\exists\delta>0$ such that $\|\mathcal{M}_\theta^{\mathrm{DM}}(\mathbf{x}_t,t,\m and $C$ is a constant.

Figures (8)

  • Figure 1: Background and brief comparison of the baselines and our PPAD. (a) compares the workflows of four diffusion methods, while (b) summarizes their characteristics and presents generated images under the same prompt, demonstrating that our method effectively corrects semantic errors and guides generation in the right direction.
  • Figure 2: Overview of the proposed method.
  • Figure 3: Comparison of generated images.
  • Figure 4: Case of denoising path.
  • Figure 5: Performance under different rounds of MLLM invocationsr in PPAD.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: See the proof in Appendix \ref{['proof:error_control']}
  • Theorem 2: See the proof in Appendix \ref{['proof:denoising']}