FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang
TL;DR
FlexEControl addresses the dual challenges of training efficiency and faithful multimodal conditioning in diffusion-based text-to-image generation. It introduces a weight-decomposition framework that shares decomposed cross-attention weights across input conditions, dramatically reducing trainable parameters and memory while preserving representational capacity. The authors add dataset augmentation, cross-attention supervision, and masked diffusion losses to enable robust handling of multiple input modalities and their combinations. Empirical results show substantial gains in efficiency (≈30% memory, ≈41% parameter reduction) with competitive or better image quality and alignment, validated by quantitative metrics and human evaluation, highlighting the method's practical impact for scalable, controllable T2I systems.
Abstract
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
