Table of Contents
Fetching ...

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Abhinav Mehrotra

TL;DR

NanoFLUX tackles the barrier of on-device high-quality text-to-image generation by distilling a 17B FLUX.1-Schnell teacher down to a compact 2.4B model through a progressive pipeline that prunes diffusion transformer heads, merges blocks, and replaces AdaLN with Static-LN. It further reduces latency via Progressive Token Downsampling and downsizes the T5-XXL text encoder to 330M using a block-wise distillation that leverages visual signals from early denoising stages. The result is a mobile-friendly diffusion model that delivers 512×512 images in about 2.5 seconds with generation quality comparable to larger baselines, broadening accessibility and privacy for on-device generation.

Abstract

While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

TL;DR

NanoFLUX tackles the barrier of on-device high-quality text-to-image generation by distilling a 17B FLUX.1-Schnell teacher down to a compact 2.4B model through a progressive pipeline that prunes diffusion transformer heads, merges blocks, and replaces AdaLN with Static-LN. It further reduces latency via Progressive Token Downsampling and downsizes the T5-XXL text encoder to 330M using a block-wise distillation that leverages visual signals from early denoising stages. The result is a mobile-friendly diffusion model that delivers 512×512 images in about 2.5 seconds with generation quality comparable to larger baselines, broadening accessibility and privacy for on-device generation.

Abstract

While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.
Paper Structure (24 sections, 5 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the teacher 17B FLUX.1-Schnell (left) and 2.4B NanoFLUX (right).
  • Figure 2: Overview of Progressive Token Downsampling. (a–b) The ResNet-based downsampler reduces token length and the upsampler restores it. (c) Progressive training enables blocks to operate on downsampled tokens incrementally. Here, $U$ includes the transformer’s output projection layers for clarity.
  • Figure 3: Attention head redundancy analysis. Low-rank reconstructions of per-token attention outputs $\text{softmax}(QK^\top)V$ using the top $r$ singular components show that $r=16$ (out of $H=24$ heads) preserves image quality in FLUX.1-Schnell (12B), indicating substantial redundancy across attention heads.
  • Figure 4: Analysing redundancy in features. Low-rank reconstructions of per-head attention outputs $(QK^\top)V$ using the top $r$ singular components show that $r=96$ preserves image quality in our 5B, indicating substantial redundancy across features.
  • Figure 5: Cosine similarity between input and output of transformer blocks in the 3B model. We observe a sequence of Single-Stream blocks (7-23) that exhibit high similarity, indicating potential redundancy.
  • ...and 4 more figures