Table of Contents
Fetching ...

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang

TL;DR

This work proposes a reinforcement learning-based post-training strategy to unlock multimodal interleaved generation capability in existing unified models, without relying on large-scale multimodal interleaved datasets.

Abstract

Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

TL;DR

This work proposes a reinforcement learning-based post-training strategy to unlock multimodal interleaved generation capability in existing unified models, without relying on large-scale multimodal interleaved datasets.

Abstract

Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
Paper Structure (18 sections, 7 equations, 4 figures, 6 tables)

This paper contains 18 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our reinforcement fine-tuning framework. Multimodal tokens are autoregressively generated and decoded into completions, with token probabilities used to compute KL divergence along a single trajectory. Hybrid rewards are assigned to each completion, and token-level group relative advantages are calculated to guide policy optimization along with KL regularization.
  • Figure 2: Visualizations of multimodal interleaved generation. Qualitative examples illustrate the model’s capacity to produce coherent interleaved outputs, smoothly transitioning between text and image modalities within a unified generation process.
  • Figure 3: Illustration of data preparation process during the warm-up stage.
  • Figure 4: Failure cases analysis about hallucinated visual content.