Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang
TL;DR
Janus-Pro-R1 tackles the disconnect between visual understanding and generation in multimodal LLMs by introducing a two-stage training regime: supervised fine-tuning to cultivate a genuine chain-of-thought for visual generation, followed by reinforcement learning to balance exploration and exploitation. This introspective approach enables iterative image generation and extends to unified image generation, including image editing, while also enhancing image semantic evaluation. Empirical results show state-of-the-art or competitive performance on text-to-image generation and editing benchmarks, and the model demonstrates strong capabilities as an image semantic evaluator that can guide RL for other models. The work highlights the potential of reinforcement learning to unlock true collaboration between perception and production in multimodal systems, with practical implications for high-quality image synthesis and robust evaluation.
Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
