OminiControl2: Efficient Conditioning for Diffusion Transformers
Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang
TL;DR
This work tackles the computational bottleneck of conditioning diffusion transformers under multi-modal signals. It introduces OminiControl2, featuring compact token representation to dramatically cut conditioning tokens and conditional feature reuse to compute condition embeddings once and reuse them across denoising steps, aided by asymmetrical attention masking. The proposed methods yield up to 5.9x speedups and over 90% reduction in conditional overhead while maintaining generation quality across multiple conditioning tasks and multi-condition scenarios. These advances make complex, multi-modal controllable image synthesis with diffusion transformers practical on standard hardware.
Abstract
Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.
