M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning
Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu
TL;DR
M3 introduces a training-free, multi-agent inference-time self-refinement framework for high-fidelity text-to-image generation. By decomposing complex prompts with a Planner, detecting and correcting misalignments with a Checker and Refiner, applying edits via an Editor, and validating outcomes with a Verifier, M3 delivers monotonic improvements through multi-round refinement. AutoRefiner exposes a plug-and-play package usable with any pre-trained T2I model, while a M3-Hybrid variant demonstrates effectiveness with lightweight tools. On GenEval and OneIG-EN benchmarks, M3 achieves state-of-the-art results (OneIG-EN overall score of $0.532$ for M3_VLM), substantially improving spatial reasoning and attribute adherence, and even surpassing commercial flagship systems, signaling a practical, scalable path to superior compositional generation without retraining.
Abstract
Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
