Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
TL;DR
The paper addresses the persistent issue that improving generative capabilities in multimodal models often harms understanding, proposing the Reason-Reflect-Refine ($R3$) framework to recast generation as a $Reason \rightarrow Reflect \rightarrow Refine$ loop that actively uses internal understanding. Built on a unified BAGEL backbone, $R3$ decomposes generation into (i) Reason to plan and draft, (ii) Reflect to critically assess alignment with the prompt, and (iii) Refine to iteratively edit, guided by outcome-based rewards and an efficient Tree-RL training regime using GRPO and FlowGRPO. Stage-wise rewards tie diffusion and text generation to a Vision-Language model score $V$, with a correctness metric $C_j$ ensuring meaningful improvements and proper termination when $\hat{V}=1$. Empirical results on GenEval++, TIIF, VQA, and ITA demonstrate that $R3$ yields stronger generation while preserving or enhancing understanding, and the approach scales with adaptive inference, providing a practical path toward unified multimodal models. This work offers a principled framework and training strategy to balance generation and understanding through structured reasoning and iterative refinement. $R3$ thus provides a blueprint for future unified multimodal systems with improved coherence and alignment to user prompts.
Abstract
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
