Table of Contents
Fetching ...

LinFusion: 1 GPU, 1 Minute, 16K Image

Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang

TL;DR

This work tackles the bottleneck of quadratic time/memory in diffusion-model self-attention for high-resolution image generation. It introduces LinFusion, a generalized linear attention framework built on normalization-aware Mamba and a non-causal design, trained via knowledge distillation from Stable Diffusion to preserve compatibility while reducing cost. Empirical results across SD-v1.5, v2.1, and XL demonstrate competitive or superior performance with substantial efficiency gains, enabling ultra-high-resolution generation (up to 16K on one GPU) and seamless integration with common SD pipelines like ControlNet and IP-Adapter. The approach offers a practical pathway to scalable, high-fidelity text-to-image generation on standard hardware.

Abstract

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

LinFusion: 1 GPU, 1 Minute, 16K Image

TL;DR

This work tackles the bottleneck of quadratic time/memory in diffusion-model self-attention for high-resolution image generation. It introduces LinFusion, a generalized linear attention framework built on normalization-aware Mamba and a non-causal design, trained via knowledge distillation from Stable Diffusion to preserve compatibility while reducing cost. Empirical results across SD-v1.5, v2.1, and XL demonstrate competitive or superior performance with substantial efficiency gains, enabling ultra-high-resolution generation (up to 16K on one GPU) and seamless integration with common SD pipelines like ControlNet and IP-Adapter. The approach offers a practical pathway to scalable, high-fidelity text-to-image generation on standard hardware.

Abstract

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
Paper Structure (18 sections, 6 theorems, 8 equations, 15 figures, 4 tables)

This paper contains 18 sections, 6 theorems, 8 equations, 15 figures, 4 tables.

Key Result

Proposition 1

Assuming that the mean of the $j$-th channel in the input feature map $X$ is $\mu_j$, and denoting $(CB^\top)\odot\tilde{A}$ as $M$, the mean of this channel in the output feature map $Y$ is $\mu_j\sum_{k=1}^nM_{ik}$.

Figures (15)

  • Figure 1: A $16384\times8192$-resolution example in the theme of Black Myth: Wukong generated by LinFusion on a single GPU with Canny-conditioned ControlNet. The textual prompt is "the back view of the Monkey King holding a rod in hand stands, 16k, high quality, best quality, style of a 3A game, fantastic style". The original picture and the extracted Canny edge are shown in Fig. \ref{['fig:5']}.
  • Figure 2: (a) and (b): Comparisons of the proposed LinFusion with original SD-v1.5 under various resolutions in terms of generation speed using 8 steps and GPU memory consumption. The dashed lines denote estimated values using quadratic functions due to out-of-memory error. (c) and (d): Efficiency comparisons on various architectures under their default resolutions.
  • Figure 3: Overview of LinFusion. We replace self-attention layers in the original SD with our LinFusion modules and adopt knowledge distillation to optimize the parameters.
  • Figure 4: (a) The architecture of Mamba2. Bi-directional SSM is additionally involved here. (b) Mamba2 without gating and RMS-Norm. (c) Normalization-aware Mamba2. (d) The proposed LinFusion module with generalized linear attention.
  • Figure 5: Qualitative text-to-image results by LinFusion based on various architectures.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof