Table of Contents
Fetching ...

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Shentong Mo, Sukmin Yun

Abstract

Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Abstract

Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.

Paper Structure

This paper contains 33 sections, 16 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the proposed LVRPO framework for unified multimodal understanding and generation. The framework proceeds in three stages: (1) Group Sampling: For a given multimodal prompt $q$, the unified MoT backbone (initialized from BAGEL) samples a group of $G$ independent outputs $\{o_1, \dots, o_G\}$, which can include text reasoning, visual tokens, or image edits. (2) Behavioral Reward Estimation: Each output is evaluated by a multi-dimensional reward system. We utilize a frozen SigLIP 2 referee to provide dense semantic grounding signals ($r_{sem}$), alongside rule-based verifiers for instruction following ($r_{ins}$) and knowledge consistency ($r_{kn}$). (3) Group Relative Optimization: Instead of using a separate critic network, LVRPO employs Group Relative Policy Optimization to compute the advantage $\hat{A}_i$ for each sample by normalizing its reward against the mean and standard deviation of the group. This behavioral feedback $\nabla_\theta \mathcal{J}_{\mathrm{LVRPO}}$ is backpropagated through the shared attention layers to jointly optimize the reasoning and generative experts, enforcing cross-modal consistency without the need for auxiliary representation-level alignment losses.
  • Figure 2: Reward convergence and policy stability during the GRPO phase. Left: Evolution of individual reward components ($r_{sem}$, $r_{ins}$, $r_{kn}$) showing consistent upward trajectories and fast semantic convergence. Right: Policy stability analysis showing stable KL-divergence as the advantage signal increases.
  • Figure 3: Analysis of the Understanding-Generation trade-off. While optimizing only for generative rewards leads to a decline in reasoning (MMMU), the full LVRPO objective demonstrates positive interference, where behavioral alignment in generation improves discriminative visual perception (MMVP).
  • Figure 4: Ablation Study: Impact of Freezing the Reward Model. We compare the training dynamics of LVRPO with a Frozen SigLIP 2 referee (blue) versus a Trainable Reward Model (red). While the trainable reward model allows the total reward to skyrocket (Left), a visual inspection of the outputs at step 2000 (Right) reveals "reward hacking," where the model generates high-frequency adversarial noise that exploits the drifting encoder. In contrast, the frozen baseline maintains semantic integrity, proving that a stable metric anchor is essential for the convergence of the GRPO objective.
  • Figure 5: Scaling behaviors of LVRPO across model sizes. We track the GenEval score improvement over training steps. The 1.3B model (green) saturates early, struggling to internalize complex spatial logic rewards ($r_{ins}$) due to limited capacity. The 7B model (blue) shows steady, robust improvement. The 13B model (red) demonstrates accelerated convergence, reaching state-of-the-art performance (0.92) in less than 2,000 steps, less than 40% of the training time required by the 7B model, validating the efficiency of GRPO on larger unified backbones.
  • ...and 1 more figures