Table of Contents
Fetching ...

TreeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search

Kaicheng Yang, Kaisen Yang, Baiting Wu, Xun Zhang, Qianrui Yang, Haotong Qin, He Zhang, Yulun Zhang

TL;DR

TreeQ tackles the practical challenge of deploying diffusion transformers under ultra-low-bit quantization. It introduces three components: Tree-Structured Search (TSS) for topology-aware, efficient mixed-precision exploration; Environmental Noise Guidance (ENG) to unify PTQ and QAT objectives with a single hyperparameter; and General Monarch Branch (GMB) to recover high-frequency details via a structured sparse, hardware-friendly decomposition. Together, these yield state-of-the-art 4-bit PTQ results on DiT-XL/2 (near lossless) and robust gains under QLoRA PEFT, while maintaining controllable search complexity. The work demonstrates that careful integration of topology-aware search, objective alignment, and high-frequency-preserving sparsity can significantly advance the practicality of low-bit diffusion models for real-world deployment.

Abstract

Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture's linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at https://github.com/racoonykc/TreeQ

TreeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search

TL;DR

TreeQ tackles the practical challenge of deploying diffusion transformers under ultra-low-bit quantization. It introduces three components: Tree-Structured Search (TSS) for topology-aware, efficient mixed-precision exploration; Environmental Noise Guidance (ENG) to unify PTQ and QAT objectives with a single hyperparameter; and General Monarch Branch (GMB) to recover high-frequency details via a structured sparse, hardware-friendly decomposition. Together, these yield state-of-the-art 4-bit PTQ results on DiT-XL/2 (near lossless) and robust gains under QLoRA PEFT, while maintaining controllable search complexity. The work demonstrates that careful integration of topology-aware search, objective alignment, and high-frequency-preserving sparsity can significantly advance the practicality of low-bit diffusion models for real-world deployment.

Abstract

Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture's linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at https://github.com/racoonykc/TreeQ

Paper Structure

This paper contains 21 sections, 18 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Visual comparison on DiT-XL/2 under low-bit PTQ. TreeQ achieves better generation compared to baseline yang2025robuq.
  • Figure 2: Analysis of DiT's local coupling properties. (a) U-Net's skip connections create long-range dependencies, whereas DiT exhibits a simple linear structure of block connections. (b) Relative L2 loss versus distance when quantizing the first block's attn.qkv layer to 4-bit. This inspires our tree-structured merging algorithm, which progressively aggregates local segments into global configurations while ensuring evaluation of each adjacency relationship.
  • Figure 3: Visualization comparison between TSS and traditional methods Integer Programming. (left) Integer programming methods typically define a layer-wise heuristic function and set the optimization goal as minimizing the sum of these functions. This implies a necessary linearity between the heuristic and final performance, leading to an optimization objective error. (right) TSS leverages DiT's local coupling, merging the search strategy from local to global. This naturally considers the interactions between multiple layers and allocates more resources to strongly coupled adjacent layers. By adopting a comparison-based Pareto queue pruning strategy, the chosen objective function (e.g., MSE) only needs to be strongly order-preserving with the final performance, correcting the optimization objective.
  • Figure 4: Visualization of ENG at 3-bit. We observe that the same mixed-precision configuration performs differently under PTQ and QAT, with some configurations favoring PTQ and others QAT. The environmental noise parameter $e$ is a highly sensitive hyperparameter that guides the search process. When applying excessive noise ($e$=2bit, below the target), the search collapses. However, setting $e$=3bit (matching the target) yields a configuration optimal for PTQ, whereas $e$=32bit (no noise) finds a configuration that adapts better to QAT.
  • Figure 5: Visualization of GMB. GMB can initialize a structured sparse matrix with arbitrary sparsity from a pre-trained weight of any shape. The block-diagonal form of the sub-matrices facilitates efficient parallel processing on modern GPUs.
  • ...and 1 more figures