M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Bangji Yang; Ruihan Guo; Jiajun Fan; Chaoran Cheng; Ge Liu

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu

TL;DR

M3 introduces a training-free, multi-agent inference-time self-refinement framework for high-fidelity text-to-image generation. By decomposing complex prompts with a Planner, detecting and correcting misalignments with a Checker and Refiner, applying edits via an Editor, and validating outcomes with a Verifier, M3 delivers monotonic improvements through multi-round refinement. AutoRefiner exposes a plug-and-play package usable with any pre-trained T2I model, while a M3-Hybrid variant demonstrates effectiveness with lightweight tools. On GenEval and OneIG-EN benchmarks, M3 achieves state-of-the-art results (OneIG-EN overall score of $0.532$ for M3_VLM), substantially improving spatial reasoning and attribute adherence, and even surpassing commercial flagship systems, signaling a practical, scalable path to superior compositional generation without retraining.

Abstract

Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

TL;DR

for M3_VLM), substantially improving spatial reasoning and attribute adherence, and even surpassing commercial flagship systems, signaling a practical, scalable path to superior compositional generation without retraining.

Abstract

Paper Structure (32 sections, 10 figures, 3 tables)

This paper contains 32 sections, 10 figures, 3 tables.

Introduction
Related Work
Text-to-Image Generative Models
Reasoning with Images
Self-Refinement in Generation
Method
The M3 Agentic Pipeline
Multi-Round Iterative and Robust Refinement
M3-Hybrid: Tool-Augmented Extension for Lightweight VLMs
AutoRefiner: A Plug-and-Play Package
Experiment
Benchmark Results on GenEval
Tool-Augmented M3: Enhancement for Specialized Models
Fully VLM-Powered M3: Maximum Fidelity for SOTA Models
Surpassing Commercial Flagship Models on OneIG-EN
...and 17 more sections

Figures (10)

Figure 1: The General Framework of M3 (Multi-Round, Multi-Agent, Multi-Modal): An Agentic Pipeline for Inference-Time Self-Refinement. (a) The Multi-Round Optimization Process (Macro-View): Illustrates the high-level, progressive enhancement workflow. A LLM-based Planner agent first analyzes the complex user prompt to generate a checklist of verifiable constraints. This checklist then orchestrates a multi-round series of iterative edits, where each round targets a specific, detected alignment failure (e.g., Iteration 0 corrects attribute binding, Iteration 1 corrects object count, Iteration 2 corrects artistic style), progressively resolving errors to produce the final refined generation. (b) The Multi-Agent Workflow within a Single Round (Micro-View): Details the inner mechanics of any single M3 Iteration, revealing the Multi-Agent closed-loop feedback system. This pipeline consists of four collaborating agents that execute the plan from (a): 1) The Checker (VLM) evaluates the image against one constraint. 2) If "Failed," the Refiner (VLM) generates a targeted edit instruction. 3) The Editor (an off-the-shelf model) executes the edit to create a new candidate. 4) The Verifier (VLM) performs quality assurance, accepting the edit only if it measurably improves alignment over the previous generation, thus ensuring monotonic enhancement.
Figure 2: M3 is a plug-and-play, model-agnostic framework that is compatible with both large-scale and lightweight VLMs and tools.
Figure 3: M3 Systematically Corrects Diverse Failure Modes in State-of-the-Art Baseline Models. Side-by-side comparison of Qwen-Image-20B and M3-refined outputs across six failure categories.
Figure 4: Visualization of M3's Multi-Round Refinement Process. Iterative enhancement from failed baseline (Iteration 0) to final aligned result for three highly complex compositional prompts.
Figure 5: Comparative Analysis of M3 Design Variants. Visual comparison showing complementary strengths: M3 (VLM) excels at creative interpretation and complex text rendering (note dramatic visual effects), while M3 (Rule) maintains tighter compositional control with cleaner, structured layouts.
...and 5 more figures

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

TL;DR

Abstract

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)