Table of Contents
Fetching ...

1.58-bit FLUX

Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen

TL;DR

The paper tackles deploying high-quality text-to-image diffusion models on memory-limited devices by aggressively quantizing FLUX's vision transformer to 1.58 bits without relying on image data. It combines post-training quantization with a custom 1.58-bit kernel, constraining 99.5% of parameters to {+1,0,-1}. Empirical results on GenEval and T2I CompBench show comparable generation quality to full-precision FLUX while achieving 7.7× storage reduction and 5.1× inference-memory reduction, with latency benefits on common GPUs. The work demonstrates the practicality of extreme low-bit quantization for large T2I models and discusses current limitations and future work to further close gaps in speed and high-resolution fidelity.

Abstract

We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.

1.58-bit FLUX

TL;DR

The paper tackles deploying high-quality text-to-image diffusion models on memory-limited devices by aggressively quantizing FLUX's vision transformer to 1.58 bits without relying on image data. It combines post-training quantization with a custom 1.58-bit kernel, constraining 99.5% of parameters to {+1,0,-1}. Empirical results on GenEval and T2I CompBench show comparable generation quality to full-precision FLUX while achieving 7.7× storage reduction and 5.1× inference-memory reduction, with latency benefits on common GPUs. The work demonstrates the practicality of extreme low-bit quantization for large T2I models and discusses current limitations and future work to further close gaps in speed and high-resolution fidelity.

Abstract

We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.

Paper Structure

This paper contains 6 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Visual comparisons between FLUX and 1.58-bit FLUX. 1.58-bit FLUX demonstrates comparable generation quality to FLUX while employing 1.58-bit quantization, where 99.5% of the 11.9B parameters in the vision transformer are constrained to the values +1, -1, or 0. For consistency, all images in each comparison are generated using the same latent noise input. 1.58-bit FLUX utilizes a custom 1.58-bit kernel. Additional visual comparisons are provided in Fig. \ref{['fig: vis_compare_geneval']} and Fig. \ref{['fig: vis_compare_t2i']}.
  • Figure 2: Efficiency measurements on the vision transformer component of FLUX and 1.58-bit FLUX. The measurements are based on generating a single image with 50 inference steps. (a) 1.58-bit FLUX reduces checkpoint storage by 7.7× compared to FLUX. (b) 1.58-bit FLUX achieves a 5.1× reduction in inference memory usage across various GPU types. The x-axis labels, $m$-$n$G, represent GPU type $m$ with a maximum memory capacity of $n$ Gigabytes (G).
  • Figure 3: Visual comparisons between FLUX and 1.58-bit FLUX on GenEval dataset. 1.58-bit FLUX demonstrates comparable generation quality to FLUX while employing 1.58-bit quantization, where 99.5% of the 11.9B parameters in the vision transformer are constrained to the values +1, -1, or 0. For consistency, all images in each comparison are generated using the same latent noise input. 1.58-bit FLUX utilizes a custom 1.58-bit kernel.
  • Figure 4: Visual comparisons between FLUX and 1.58-bit FLUX on the validation split of T2I CompBench. 1.58-bit FLUX demonstrates comparable generation quality to FLUX while employing 1.58-bit quantization, where 99.5% of the 11.9B parameters in the vision transformer are constrained to the values +1, -1, or 0. For consistency, all images in each comparison are generated using the same latent noise input. 1.58-bit FLUX utilizes a custom 1.58-bit kernel.