Table of Contents
Fetching ...

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren

TL;DR

SnapGen addresses the challenge of deploying high-resolution T2I diffusion on mobile by combining an extremely compact UNet and a tiny decoder with a comprehensive training pipeline that includes multi-level knowledge distillation and adversarial step distillation. The approach enables $1024^2$ image generation on-device in about $1.4$ seconds, with a parameter count under $0.38$B and competitive FID and GenEval scores against billions-parameter models. Key contributions include an efficient UNet architecture, a streamlined decoder, flow-based training with timestep-aware distillation, and few-step generation enabled by diffusion-GAN guidance. This work significantly improves on-device generation quality and speed, enabling practical on-device T2I with strong prompt-following and realism while reducing bandwidth and cloud costs.

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

TL;DR

SnapGen addresses the challenge of deploying high-resolution T2I diffusion on mobile by combining an extremely compact UNet and a tiny decoder with a comprehensive training pipeline that includes multi-level knowledge distillation and adversarial step distillation. The approach enables image generation on-device in about seconds, with a parameter count under B and competitive FID and GenEval scores against billions-parameter models. Key contributions include an efficient UNet architecture, a streamlined decoder, flow-based training with timestep-aware distillation, and few-step generation enabled by diffusion-GAN guidance. This work significantly improves on-device generation quality and speed, enabling practical on-device T2I with strong prompt-following and realism while reducing bandwidth and cloud costs.

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

Paper Structure

This paper contains 19 sections, 9 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Efficient UNet. Starting from a thinner and shorter version of the UNet from SDXL (as in (a)), we explore a series of architectural changes, i.e., (b)--(f), to develop a smaller and faster model while retaining high-quality generation performance, as evaluated in \ref{['fig:unet_ablation']}.
  • Figure 1: Demo on iPhone 16 Pro-Max. We report the forward time for each 4-step generation, excluding the model loading time.
  • Figure 2: Comparisons of Performance and Efficiency for various Design Choices of Efficient UNet. The generation quality is evaluated using FID calculated on ImageNet-1K for $256^2$ px generation. The efficiency metrics include model parameters, latency, and FLOPs. FLOPs and latency (on iPhone 15 Pro) are measured with a $128\times128$ latent, equivalent to a $1024\times1024$ decoded image, for one forward pass. We show the architecture enhancements that improve any of the metrics without hurting others.
  • Figure 2: Comparisons of Decoder Reconstruction between SD3 decoder and our tiny decoder. Zoom in for better viewing.
  • Figure 3: Comparisons of Decoder Architecture between (a) SDXL/SD3 decoder and (b) our tiny decoder.
  • ...and 8 more figures