Table of Contents
Fetching ...

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Songhua Liu, Zhenxiong Tan, Xinchao Wang

TL;DR

This work tackles the latency bottleneck of quadratic attention in diffusion transformers by introducing CLEAR, a convolution-like local attention with a circular window that achieves linear complexity. CLEAR hinges on four design principles—locality, formulation consistency, high-rank attention maps, and feature integrity—and enables effective distillation from a pre-trained DiT to a linearized student using only 10K self-generated samples. The resulting model attains near-teacher performance with a 99.5% reduction in attention computations and a 6.3× speedup for 8K image generation, while preserving cross-resolution generalization and plugin compatibility. Despite notable gains, the authors acknowledge a gap between practical acceleration and theoretical FLOPS at low resolutions, suggesting future work on fused CUDA operators tailored to CLEAR's sparse pattern.

Abstract

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

TL;DR

This work tackles the latency bottleneck of quadratic attention in diffusion transformers by introducing CLEAR, a convolution-like local attention with a circular window that achieves linear complexity. CLEAR hinges on four design principles—locality, formulation consistency, high-rank attention maps, and feature integrity—and enables effective distillation from a pre-trained DiT to a linearized student using only 10K self-generated samples. The resulting model attains near-teacher performance with a 99.5% reduction in attention computations and a 6.3× speedup for 8K image generation, while preserving cross-resolution generalization and plugin compatibility. Despite notable gains, the authors acknowledge a gap between practical acceleration and theoretical FLOPS at low resolutions, suggesting future work on fused CUDA operators tailored to CLEAR's sparse pattern.

Abstract

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

Paper Structure

This paper contains 22 sections, 16 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of speed and GFLOPS between the proposed linearized DiT and the original FLUX.1-dev. Speed is evaluated by performing 20 denoising steps on a single H100 GPU. FLOPS is calculated with the approximation: $4\times\sum M\times c$, where $c$ is the feature dimension and $M$ denotes the attention masks. $\log_2$ is applied on both vertical axes for better visualization. The raw data are supplemented in the appendix.
  • Figure 2: Preliminary results of various efficient attention methods on FLUX-1.dev. The prompt is "A small blue plane sitting on top of a field".
  • Figure 3: Visualization of attention maps by various heads for an intermediate denoising step. Attention in pre-trained DiTs is largely conducted in a local fashion.
  • Figure 4: We try perturbing remote and local features respectively through clipping the relative distances required for rotary position embedding. Perturbing remote features has no obvious impact on image quality, whereas altering local features results in significant distortion. The text prompt and the original generation result are consistent with Fig. \ref{['fig:3']}.
  • Figure 5: Illustration of the proposed convolution-like linearization strategy for pre-trained DiTs. In each text-image joint attention module, text queries aggregate information from all text and image tokens, while each image token gathers information only from tokens within a local circular window.
  • ...and 10 more figures