Table of Contents
Fetching ...

ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, Haoqian Wang

TL;DR

The paper tackles the memory and latency hurdles of diffusion transformers by introducing ConvRot, a group-wise Regular Hadamard Transform-based rotation that suppresses both row-wise and column-wise outliers. This enables training-free W4A4 inference via ConvLinear4bit, significantly reducing memory and improving speed while preserving image quality. The authors provide a theoretical foundation for regular Hadamard matrices, including a Kronecker construction, and demonstrate substantial end-to-end gains on FLUX.1-dev with competitive visual fidelity. The work positions rotation-based quantization as a practical tool for deploying large diffusion models on commodity hardware.

Abstract

Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$\times$ speedup and 4.05$\times$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.

ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

TL;DR

The paper tackles the memory and latency hurdles of diffusion transformers by introducing ConvRot, a group-wise Regular Hadamard Transform-based rotation that suppresses both row-wise and column-wise outliers. This enables training-free W4A4 inference via ConvLinear4bit, significantly reducing memory and improving speed while preserving image quality. The authors provide a theoretical foundation for regular Hadamard matrices, including a Kronecker construction, and demonstrate substantial end-to-end gains on FLUX.1-dev with competitive visual fidelity. The work positions rotation-based quantization as a practical tool for deploying large diffusion models on commodity hardware.

Abstract

Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26 speedup and 4.05 memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.

Paper Structure

This paper contains 24 sections, 3 theorems, 21 equations, 8 figures, 8 tables.

Key Result

Theorem 2.1

For an $\mathcal{H}$-Matrix $\mathbf{H}_n$,

Figures (8)

  • Figure 1: Rotation-based quantization methods using Hadamard matrices can effectively suppress outliers by redistributing energy across channels. However, Sylvester-type Hadamard matrices lead to energy concentration when encountering row-wise outliers.
  • Figure 2: Effect of Hadamard transforms on the single_transformer_blocks.37.proj_out activations in Flux. The standard transform amplifies outliers (max $=106.19$), while the group-wise regular transform suppresses them (max $=9.26$), compared to the original (max $=14.48$).
  • Figure 3: Overview of ConvRot. Left: ConvLinear4bit serves as a plug-and-play replacement for Linear layers. Right: ConvRot applies Regular Hadamard Transform (RHT) on non-overlapping sliding windows of the activation tensor, with each window multiplied by a regular Hadamard matrix.
  • Figure 4: Visual comparison of our method using different rotation sizes on the MJHQ-30K dataset. Prompts cover diverse themes including food, human portraits, animals, landscapes, indoor scenes, and figurines.
  • Figure 5: Qualitative impact of hybrid-precision inference on sDCI.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 2.1: Hadamard Matrix ($\mathcal{H}$-Matrix)
  • Theorem 2.1: Column Sum Squared Property
  • Definition 2.2: Column Discrepancy
  • Definition 2.3: Regular $\mathcal{H}$-Matrix
  • Theorem 2.2
  • Theorem 2.3: Kronecker Construction
  • proof
  • proof
  • proof