Table of Contents
Fetching ...

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Weijia Mao, Zhenheng Yang, Mike Zheng Shou

TL;DR

The paper tackles the challenge of post-training unified multimodal models without external image data by introducing UniRL, a self-improving framework that generates images from prompts and uses the resulting data to jointly train generation and understanding via SFT and GRPO. It constructs GenEval-inspired prompts to cover core visual features, and employs a differentiable sampling strategy to propagate learning signals through both modalities. A new imbalance metric quantifies the alignment between generation and understanding, and experiments on Show-o and Janus show that GRPO-based optimization yields robust improvements and tighter task balance compared to SFT and baselines. The work provides practical guidance for end-to-end versus non-end-to-end training regimes and demonstrates the feasibility of self-sufficient post-training with minimal external data.

Abstract

Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

TL;DR

The paper tackles the challenge of post-training unified multimodal models without external image data by introducing UniRL, a self-improving framework that generates images from prompts and uses the resulting data to jointly train generation and understanding via SFT and GRPO. It constructs GenEval-inspired prompts to cover core visual features, and employs a differentiable sampling strategy to propagate learning signals through both modalities. A new imbalance metric quantifies the alignment between generation and understanding, and experiments on Show-o and Janus show that GRPO-based optimization yields robust improvements and tighter task balance compared to SFT and baselines. The work provides practical guidance for end-to-end versus non-end-to-end training regimes and demonstrates the feasibility of self-sufficient post-training with minimal external data.

Abstract

Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

Paper Structure

This paper contains 26 sections, 12 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The imbalance between Text-to-Image Generation (T2I) and Multi-modal Understanding (MMU). For the same image, unified multimodal models may struggle to perform both generation and understanding consistently.
  • Figure 2: Training pipeline of UniRL (a) SFT Optimization: A prompt is used to generate a single image, which, together with the corresponding question, is used to predict an answer and compute the SFT loss. (b) GRPO Optimization: The same prompt generates a set of images, each paired with the same question to produce multiple answers, which are used to compute the GRPO loss.
  • Figure 3: Qualitative comparison of our method with the original models on both text-to-image generation (T2I) and multimodal understanding (MMU) tasks.
  • Figure 4: Visualization results of our method.