Table of Contents
Fetching ...

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang

TL;DR

Janus-Pro-R1 tackles the disconnect between visual understanding and generation in multimodal LLMs by introducing a two-stage training regime: supervised fine-tuning to cultivate a genuine chain-of-thought for visual generation, followed by reinforcement learning to balance exploration and exploitation. This introspective approach enables iterative image generation and extends to unified image generation, including image editing, while also enhancing image semantic evaluation. Empirical results show state-of-the-art or competitive performance on text-to-image generation and editing benchmarks, and the model demonstrates strong capabilities as an image semantic evaluator that can guide RL for other models. The work highlights the potential of reinforcement learning to unlock true collaboration between perception and production in multimodal systems, with practical implications for high-quality image synthesis and robust evaluation.

Abstract

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

TL;DR

Janus-Pro-R1 tackles the disconnect between visual understanding and generation in multimodal LLMs by introducing a two-stage training regime: supervised fine-tuning to cultivate a genuine chain-of-thought for visual generation, followed by reinforcement learning to balance exploration and exploitation. This introspective approach enables iterative image generation and extends to unified image generation, including image editing, while also enhancing image semantic evaluation. Empirical results show state-of-the-art or competitive performance on text-to-image generation and editing benchmarks, and the model demonstrates strong capabilities as an image semantic evaluator that can guide RL for other models. The work highlights the potential of reinforcement learning to unlock true collaboration between perception and production in multimodal systems, with practical implications for high-quality image synthesis and robust evaluation.

Abstract

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

Paper Structure

This paper contains 49 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We could bring about three revolutionary benefits for image generation after collaborating the visual comprehension and generation capabilities with MLLMs.
  • Figure 2: We include two training stages to unlock aha moments with CoT in visual generation: supervised fine-tuning and reinforcement learning.
  • Figure 3: Qualitative examples of introspective text-to-image generation that triggers Aha moments.
  • Figure 4: Our model achieve a stable trade-off between fidelity and editability in multi-turn editing.
  • Figure 5: (a) RL Performance with different reward models. (b) Performance with different selecting thresholds in SFT.
  • ...and 3 more figures