Table of Contents
Fetching ...

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, Jie Zhang

TL;DR

RelaCtrl addresses inefficiency in controllable diffusion transformers by introducing a relevance-guided strategy that allocates control blocks to layers with high ControlNet Relevance Score while replacing heavy copy blocks with the lightweight Two-Dimensional Shuffle Mixer. The approach couples a Relevance-Guided Lightweight Control Block with TDSM to reduce parameters and FLOPs without compromising control fidelity, achieving comparable or superior results to state-of-the-art methods. Experimental results across multiple conditional tasks and models show improved control accuracy and image quality with modest resource overhead, and the method generalizes to Flux. This yields a practical, scalable solution for efficient controllable generation in diffusion transformers.

Abstract

The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

TL;DR

RelaCtrl addresses inefficiency in controllable diffusion transformers by introducing a relevance-guided strategy that allocates control blocks to layers with high ControlNet Relevance Score while replacing heavy copy blocks with the lightweight Two-Dimensional Shuffle Mixer. The approach couples a Relevance-Guided Lightweight Control Block with TDSM to reduce parameters and FLOPs without compromising control fidelity, achieving comparable or superior results to state-of-the-art methods. Experimental results across multiple conditional tasks and models show improved control accuracy and image quality with modest resource overhead, and the method generalizes to Flux. This yields a practical, scalable solution for efficient controllable generation in diffusion transformers.

Abstract

The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.

Paper Structure

This paper contains 24 sections, 2 theorems, 14 equations, 9 figures, 5 tables.

Key Result

Theorem 2.4

The lower bound of $d({{t}_{j}})$ is:

Figures (9)

  • Figure 1: Effect of skipping a specific position within the ControlNet block on the quality of the generated image. Higher FID and HDD indicate a more significant impact of the skipped layer on the quality of the final results, reflecting a stronger correlation with the generated image quality.
  • Figure 2: The relevance diagram of different layers in the DiT-ControlNet was calculated based on the FID and HDD ranks. The overall trend shows an initial increase followed by a decrease. The selected placement positions of RelaCtrl in PixArt-$\alpha$ are marked with white numbers.
  • Figure 3: The overall architecture of RelaCtrl. Control block locations are prioritized based on the ControlNet Relevance Score, ranked from highest to lowest. The direct duplication of the main branch in the original ControlNet is replaced with the carefully designed Reference-Guided Lightweight control block. Additionally, the Two-Dimensional Shuffle Mixer effectively reduces model parameters and computational overhead while preserving performance.
  • Figure 4: Qualitative comparison of different methods. Please zoom in for better details.
  • Figure 5: Generation effects of RelaCtrl under varying control conditions.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 2.4
  • Corollary 2.5