Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Sen Ye; Mengde Xu; Shuyang Gu; Di He; Liwei Wang; Han Hu

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu

TL;DR

The paper addresses the persistent issue that improving generative capabilities in multimodal models often harms understanding, proposing the Reason-Reflect-Refine ($R3$) framework to recast generation as a $Reason \rightarrow Reflect \rightarrow Refine$ loop that actively uses internal understanding. Built on a unified BAGEL backbone, $R3$ decomposes generation into (i) Reason to plan and draft, (ii) Reflect to critically assess alignment with the prompt, and (iii) Refine to iteratively edit, guided by outcome-based rewards and an efficient Tree-RL training regime using GRPO and FlowGRPO. Stage-wise rewards tie diffusion and text generation to a Vision-Language model score $V$, with a correctness metric $C_j$ ensuring meaningful improvements and proper termination when $\hat{V}=1$. Empirical results on GenEval++, TIIF, VQA, and ITA demonstrate that $R3$ yields stronger generation while preserving or enhancing understanding, and the approach scales with adaptive inference, providing a practical path toward unified multimodal models. This work offers a principled framework and training strategy to balance generation and understanding through structured reasoning and iterative refinement. $R3$ thus provides a blueprint for future unified multimodal systems with improved coherence and alignment to user prompts.

Abstract

Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

TL;DR

The paper addresses the persistent issue that improving generative capabilities in multimodal models often harms understanding, proposing the Reason-Reflect-Refine (

) framework to recast generation as a

loop that actively uses internal understanding. Built on a unified BAGEL backbone,

decomposes generation into (i) Reason to plan and draft, (ii) Reflect to critically assess alignment with the prompt, and (iii) Refine to iteratively edit, guided by outcome-based rewards and an efficient Tree-RL training regime using GRPO and FlowGRPO. Stage-wise rewards tie diffusion and text generation to a Vision-Language model score

, with a correctness metric

ensuring meaningful improvements and proper termination when

. Empirical results on GenEval++, TIIF, VQA, and ITA demonstrate that

yields stronger generation while preserving or enhancing understanding, and the approach scales with adaptive inference, providing a practical path toward unified multimodal models. This work offers a principled framework and training strategy to balance generation and understanding through structured reasoning and iterative refinement.

thus provides a blueprint for future unified multimodal systems with improved coherence and alignment to user prompts.

Abstract

Paper Structure (35 sections, 12 equations, 16 figures, 9 tables)

This paper contains 35 sections, 12 equations, 16 figures, 9 tables.

Introduction
Methodology
Unifying Generation and Understanding
Framework Overview
Tree-RL strategy
Stage-wise Reward
Experiments
Co-evolution of Understanding and Generation
Ablation Studies
Related Work
Conclusion
Appendix
Clarification on the Use of LLM
RL Training with GRPO
Group-Relative Policy Optimization (GRPO)
...and 20 more sections

Figures (16)

Figure 1: Fine-tuning BAGEL exclusively on generation or understanding degrades the complementary capability. Naive co-training shows minor gains, whereas our proposed method demonstrates significant improvement in both. Results are reported on counting subset of GenEval++.
Figure 2: The inference pipeline of our Reason-Reflect-Refine framework. The model starts by Reasoning to produce an initial plan and image. It then enters an iterative Reflect-Refine loop, assessing its output and making corrections until the image aligns with the user's prompt or a stopping condition is met.
Figure 3: The training procedure, which alternates between optimizing the Reason policy and the Reflect-Refine policies. The replay buffer, populated by the Reason stage, provides on-policy data for training the subsequent stages.
Figure 4: Training reward curves of the Tree-RL versus Full Trajectory RL strategies. The reward curve for Full Trajectory RL is substantially lower than that of Tree-RL. This performance gap is attributed to the high variance and noise introduced by the long trajectories inherent in the full trajectory approach, which complicates the advantage assignment problem.
Figure 5: Qualitative comparison between Bagel and our results.
...and 11 more figures

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

TL;DR

Abstract

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)