Table of Contents
Fetching ...

OminiControl2: Efficient Conditioning for Diffusion Transformers

Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang

TL;DR

This work tackles the computational bottleneck of conditioning diffusion transformers under multi-modal signals. It introduces OminiControl2, featuring compact token representation to dramatically cut conditioning tokens and conditional feature reuse to compute condition embeddings once and reuse them across denoising steps, aided by asymmetrical attention masking. The proposed methods yield up to 5.9x speedups and over 90% reduction in conditional overhead while maintaining generation quality across multiple conditioning tasks and multi-condition scenarios. These advances make complex, multi-modal controllable image synthesis with diffusion transformers practical on standard hardware.

Abstract

Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.

OminiControl2: Efficient Conditioning for Diffusion Transformers

TL;DR

This work tackles the computational bottleneck of conditioning diffusion transformers under multi-modal signals. It introduces OminiControl2, featuring compact token representation to dramatically cut conditioning tokens and conditional feature reuse to compute condition embeddings once and reuse them across denoising steps, aided by asymmetrical attention masking. The proposed methods yield up to 5.9x speedups and over 90% reduction in conditional overhead while maintaining generation quality across multiple conditioning tasks and multi-condition scenarios. These advances make complex, multi-modal controllable image synthesis with diffusion transformers practical on standard hardware.

Abstract

Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9 speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.

Paper Structure

This paper contains 32 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of the compression and position correcting of compact token representation for condition images.
  • Figure 2: Illustration of token integration processing for inpainting. By combining noisy and condition tokens based on the mask, we reduce token count from 2N to N.
  • Figure 3: Feature similarity across denoising steps. While noisy image tokens $X$ change significantly between steps (left), condition tokens $C_{I}$ maintain high similarity throughout the denoising process (right).
  • Figure 4: Visual comparison of the original inference pipeline (left) and naive feature reuse strategy (right).
  • Figure 5: Illustration of our feature reuse strategy. (a) Overview of the denoising pipeline, with full computation performed only at the first step. (b) Detailed view of the feature reuse mechanism in Attention, where condition token features($K$$V$) computed in the first step are reused in subsequent steps. (c) Asymmetric attention mask that prevents condition tokens from attending to noisy image tokens, enabling consistent feature reuse.
  • ...and 4 more figures