Table of Contents
Fetching ...

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou

TL;DR

Z-Image introduces a budget-friendly 6B diffusion-transformer built on a Scalable Single-Stream DiT (S3-DiT) that achieves competitive image generation with far less compute. The approach hinges on an end-to-end pipeline spanning Efficient Data Infrastructure, Omni-pretraining, and Prompt-Enhancer–assisted SFT, complemented by few-step distillation (Decoupled DMD and DMDR) and RLHF to deliver high-quality, multilingual outputs. A public release of code and weights accompanies extensive evaluations, including Elo-based human judgments and broad benchmarks, where Z-Image-Turbo achieves top open-source standings with sub-second inference on enterprise hardware. The work demonstrates that carefully engineered data, architecture, and post-training strategies can rival state-of-the-art closed models while dramatically reducing computational costs, enabling broader access and practical deployment in constrained environments.

Abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

TL;DR

Z-Image introduces a budget-friendly 6B diffusion-transformer built on a Scalable Single-Stream DiT (S3-DiT) that achieves competitive image generation with far less compute. The approach hinges on an end-to-end pipeline spanning Efficient Data Infrastructure, Omni-pretraining, and Prompt-Enhancer–assisted SFT, complemented by few-step distillation (Decoupled DMD and DMDR) and RLHF to deliver high-quality, multilingual outputs. A public release of code and weights accompanies extensive evaluations, including Elo-based human judgments and broad benchmarks, where Z-Image-Turbo achieves top open-source standings with sub-second inference on enterprise hardware. The work demonstrates that carefully engineered data, architecture, and post-training strategies can rival state-of-the-art closed models while dramatically reducing computational costs, enabling broader access and practical deployment in constrained environments.

Abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Paper Structure

This paper contains 61 sections, 1 equation, 32 figures, 15 tables.

Figures (32)

  • Figure 1: Showcases of Z-Image-Turbo in photo-realistic image generation. All related prompts can be found in Appendix \ref{['sec:fig_1']}.
  • Figure 2: Showcases of Z-Image-Turbo in bilingual text-rendering. All related prompts can be found in Appendix \ref{['sec:fig_2']}.
  • Figure 3: Showcases of Z-Image-Edit in various image-to-image tasks. Each arrow represents an edit from the input to output images. All related prompts can be found in Appendix \ref{['sec:fig_3']}.
  • Figure 4: Showcases of comparison between Z-Image-Turbo and currently state-of-the-art models qin2025luminaqwenimagecao2025hunyuanimagenanoproflux-2-2025seedream2025seedreamgao2025seedreamgoogle2025imagen4. Z-Image-Turbo shows conspicuous photo-realistic generation capacity.
  • Figure 5: Overview of the Active Curation Engine. The pipeline refines uncurated data through cross-modal embedding, deduplication, and rule-based filtering to construct a high-quality augmented dataset. A feedback mechanism leverages the Z-Image model to diagnose long-tail distribution deficiencies, dynamically guiding cross-modal retrieval to reinforce the data collection process. The "Squirrel Fish" ( 松鼠鳜鱼) case illustrates a classic long-tail challenge: it is actually the name of a Chinese cuisine but the model lacks the specific concept for this dish and may rely on compositional reasoning (combining "Squirrel" ( 松鼠) and "Fish" ( 鳜鱼)), leading to erroneous generations absent of domain-specific training data.
  • ...and 27 more figures