Table of Contents
Fetching ...

Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, Chuanxin Tang, Zidong Wang, Yichen Wei, Liang Hu, Boyi Jiang, William Li, Ying He, Yang Liu, Xuchen Song, Eric Li, Yahui Zhou

TL;DR

The paper tackles efficiency gaps in unified multimodal models by proposing a lightweight 2B diffusion-based Kontext framework (UniPic2-SD3.5M-Kontext) aligned with a frozen MLLM via a learnable connector (MetaQuery). It introduces Progressive Dual-Task Reinforcement (PDTR) with GRPO to jointly optimize image generation and editing in a staged, interference-free manner, achieving state-of-the-art results with far fewer parameters than many baselines. Extending this, UniPic2-MetaQuery enables end-to-end understanding, generation, and editing through a modular connector, demonstrating strong generalization and scalability across multimodal tasks. The Skywork UniPic 2.0 framework shows practical potential for deployable, efficient multimodal intelligence and provides public release of models and code to support reproducibility.

Abstract

Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

TL;DR

The paper tackles efficiency gaps in unified multimodal models by proposing a lightweight 2B diffusion-based Kontext framework (UniPic2-SD3.5M-Kontext) aligned with a frozen MLLM via a learnable connector (MetaQuery). It introduces Progressive Dual-Task Reinforcement (PDTR) with GRPO to jointly optimize image generation and editing in a staged, interference-free manner, achieving state-of-the-art results with far fewer parameters than many baselines. Extending this, UniPic2-MetaQuery enables end-to-end understanding, generation, and editing through a modular connector, demonstrating strong generalization and scalability across multimodal tasks. The Skywork UniPic 2.0 framework shows practical potential for deployable, efficient multimodal intelligence and provides public release of models and code to support reproducibility.

Abstract

Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Showcase of the UniPic 2.0 in image generation and editing.
  • Figure 2: The overall pipeline of UniPic 2.0.
  • Figure 3: Qualitative comparison of text-to-image generation results.
  • Figure 4: Qualitative comparison of image editing results.
  • Figure 5: Qualitative examples illustrating the capabilities of UniPic2-Metaquery across diverse multimodal tasks.
  • ...and 1 more figures