Table of Contents
Fetching ...

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou

TL;DR

Skywork UniPic presents a 1.5B-parameter unified autoregressive model that integrates image understanding, generation, and editing in a single architecture through decoupled visual encoders feeding a shared LLM. It combines MAR-based generation with SigLIP2-based understanding, and trains with a four-stage curriculum and reward-driven data quality pipelines to achieve competitive results at high resolutions on commodity hardware. Key contributions include the decoupled encoding strategy, a progressive training schedule, and curated reward-based data quality assurance that enable state-of-the-art-like performance across GenEval, DPG-Bench, and editing benchmarks while maintaining efficiency. The work demonstrates that high-fidelity, versatile multimodal capabilities can be realized without large-scale parameter growth, offering a practical path to deployable multimodal AI in resource-constrained settings.

Abstract

We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

TL;DR

Skywork UniPic presents a 1.5B-parameter unified autoregressive model that integrates image understanding, generation, and editing in a single architecture through decoupled visual encoders feeding a shared LLM. It combines MAR-based generation with SigLIP2-based understanding, and trains with a four-stage curriculum and reward-driven data quality pipelines to achieve competitive results at high resolutions on commodity hardware. Key contributions include the decoupled encoding strategy, a progressive training schedule, and curated reward-based data quality assurance that enable state-of-the-art-like performance across GenEval, DPG-Bench, and editing benchmarks while maintaining efficiency. The work demonstrates that high-fidelity, versatile multimodal capabilities can be realized without large-scale parameter growth, offering a practical path to deployable multimodal AI in resource-constrained settings.

Abstract

We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Showcases of our model's performance on editing and generation tasks.
  • Figure 2: The overall framework of Skywork UniPic. (a) Image generation is achieved through a masked auto-regressive process using the MAR model li2024autoregressiveimagegenerationvector. (b) Image understanding is performed using a SigLIP2 encoder tschannen2025siglip2multilingualvisionlanguage to extract rich visual features, which are subsequently passed to an LLM for autoregressive text generation. They share a single LLM to promote consistent instruction-following and enable knowledge transfer between generation and understanding tasks
  • Figure 3: Performance comparison across multiple benchmarks. Skywork UniPic demonstrates competitive performance across understanding, generation, editing, and in-context tasks while maintaining exceptional parameter efficiency with only 1.5B activated parameters.
  • Figure 4: Qualitative comparison of text-to-image generation results. Skywork UniPic produces high-quality images that accurately reflect textual prompts while maintaining competitive visual fidelity compared to both open-source and proprietary models.
  • Figure 5: Qualitative comparison of image editing results. Skywork UniPic successfully handles diverse editing instructions while preserving image quality and maintaining consistency in unmodified regions, demonstrating the effectiveness of our unified approach.
  • ...and 1 more figures