Table of Contents
Fetching ...

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, Chao Ma

TL;DR

This work tackles unified multimodal understanding and generation (ULMs) without supervised data by introducing CoRL, a two-stage reinforcement learning framework built on group relative policy optimization (GRPO). A suite of verifiable rewards, including bidirectional cycle consistency and text-image matching, guides a unified RL stage for joint optimization, followed by a refined RL stage for task-specific enhancement, yielding ULM-R1. Empirical results show substantial gains on both generation and multimodal understanding benchmarks, notably +7% on generation datasets and +23% across nine understanding benchmarks, highlighting cross-task synergy and data efficiency. The approach advances the practical deployment of unified multimodal models by reducing reliance on supervised data while achieving strong cross-task performance, with public code and scalable design components.

Abstract

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

TL;DR

This work tackles unified multimodal understanding and generation (ULMs) without supervised data by introducing CoRL, a two-stage reinforcement learning framework built on group relative policy optimization (GRPO). A suite of verifiable rewards, including bidirectional cycle consistency and text-image matching, guides a unified RL stage for joint optimization, followed by a refined RL stage for task-specific enhancement, yielding ULM-R1. Empirical results show substantial gains on both generation and multimodal understanding benchmarks, notably +7% on generation datasets and +23% across nine understanding benchmarks, highlighting cross-task synergy and data efficiency. The approach advances the practical deployment of unified multimodal models by reducing reliance on supervised data while achieving strong cross-task performance, with public code and scalable design components.

Abstract

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Results of different RL paradigms. Janus-Pro-1B chen2025januspro serves as the baseline.
  • Figure 2: Overview of CoRL, a co-reinforcement learning framework to jointly improve the dual capabilities of ULMs. CoRL adopts a two-stage RL procedure, comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement.
  • Figure 3: Qualitative comparison of text-to-image generation between Janus-Pro and ULM-R1. The red box marks an exemplary failure case.
  • Figure 4: Qualitative comparison of multimodal understanding between Janus-Pro and ULM-R1. The red box marks an exemplary failure case.
  • Figure 5: Illustration of training examples used in unified reinforcement learning.