Table of Contents
Fetching ...

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang

TL;DR

DeepGen 1.0 tackles the problem of prohibitive training costs and deployment footprints in unified multimodal image generation and editing by delivering a lightweight 5B model. It introduces Stacked Channel Bridging (SCB) to fuse multi-layer VLM features with learnable think tokens, paired with a three-stage data-centric training regime (alignment pre-training, joint supervised fine-tuning, and MR-GRPO reinforcement learning) to achieve omni-capabilities. Despite its modest size, DeepGen 1.0 attains competitive or superior performance compared with models up to 16× larger on a wide range of benchmarks, including reasoning-intensive generation and editing, while training on roughly 50M samples. The authors provide open-source training code, weights, and data to democratize access to high-performance unified multimodal research and demonstrate practical impact for scalable AI development.

Abstract

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

TL;DR

DeepGen 1.0 tackles the problem of prohibitive training costs and deployment footprints in unified multimodal image generation and editing by delivering a lightweight 5B model. It introduces Stacked Channel Bridging (SCB) to fuse multi-layer VLM features with learnable think tokens, paired with a three-stage data-centric training regime (alignment pre-training, joint supervised fine-tuning, and MR-GRPO reinforcement learning) to achieve omni-capabilities. Despite its modest size, DeepGen 1.0 attains competitive or superior performance compared with models up to 16× larger on a wide range of benchmarks, including reasoning-intensive generation and editing, while training on roughly 50M samples. The authors provide open-source training code, weights, and data to democratize access to high-performance unified multimodal research and demonstrate practical impact for scalable AI development.

Abstract

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
Paper Structure (20 sections, 7 equations, 6 figures, 11 tables)

This paper contains 20 sections, 7 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of DeepGen 1.0’s visual generation and editing abilities, including reasoning-intensive scenarios.
  • Figure 2: Model performance comparison on image generation and editing benchmarks. Bubble size is proportional to model parameter count. Dashed outer rings indicate models with unreported parameter counts. Higher scores correspond to better performance.
  • Figure 3: Overview of DeepGen 1.0 architecture. DeepGen 1.0 adopts a unified VLM-DiT paradigm with a dual-branch visual encoding strategy: a ViT encoder captures high-level semantics for the VLM, while a VAE encoder extracts compressed latent features for the DiT. Multimodal conditions derived from the VLM, together with reference-image VAE latents, are concatenated with the target image’s noise tokens to form a single DiT input sequence, enabling self-attention over both conditioning and generation signals. Stacked channel bridging (SCB) performs deep feature fusion between the VLM and DiT to strengthen generation and editing, while DiT positional encodings explicitly distinguish reference tokens from target tokens. Icons shown at the right of each block indicate whether the corresponding module is frozen or trainable during the Pre-Training, SFT, and RL stages, respectively.
  • Figure 4: Overview of our training data for broad omni-capabilities and comprehensive evaluation across benchmarks.
  • Figure 5: UniGenBench evaluation curves during RL training over 1,500 steps. The left axis shows the overall score and the right axis shows the text generation sub-score. Both metrics improve steadily throughout training, with the overall score rising from $\sim$0.747 to $\sim$0.756 and the text score increasing from $\sim$0.25 to $\sim$0.34, demonstrating that RL simultaneously enhances text rendering fidelity and general generation quality.
  • ...and 1 more figures