FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Xuehai He; Jian Zheng; Jacob Zhiyuan Fang; Robinson Piramuthu; Mohit Bansal; Vicente Ordonez; Gunnar A Sigurdsson; Nanyun Peng; Xin Eric Wang

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

TL;DR

FlexEControl addresses the dual challenges of training efficiency and faithful multimodal conditioning in diffusion-based text-to-image generation. It introduces a weight-decomposition framework that shares decomposed cross-attention weights across input conditions, dramatically reducing trainable parameters and memory while preserving representational capacity. The authors add dataset augmentation, cross-attention supervision, and masked diffusion losses to enable robust handling of multiple input modalities and their combinations. Empirical results show substantial gains in efficiency (≈30% memory, ≈41% parameter reduction) with competitive or better image quality and alignment, validated by quantitative metrics and human evaluation, highlighting the method's practical impact for scalable, controllable T2I systems.

Abstract

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 6 figures, 5 tables)

This paper contains 20 sections, 6 equations, 6 figures, 5 tables.

Introduction
Method
Preliminary
Efficient Training for Controllable Text-to-Image (T2I) Generation
Enhanced Training for Conditional Inputs
Dataset Augmentation with Text Parsing and Segmentation
Cross-Attention Supervision
Masked Noise Prediction
Experiments
Datasets
Evaluation Metrics
Experimental Setup
Structural Input Condition Extraction
Baselines
Quantitative Results
...and 5 more sections

Figures (6)

Figure 1: (a) FlexEControl excels in training efficiency, achieving superior performance with just half the training data compared to its counterparts on (b) Controllable Text-to-Image Generation w. Different Input Conditions (one edge map and one segmentation map). (c) FlexEControl effectively conditions on two canny edge maps. The text prompt is Stormtrooper's lecture at the football field in both Figure (b) and Figure (c).
Figure 2: Overview of FlexEControl: a decomposed green matrix is shared across different input conditions, significantly enhancing the model's efficiency. During training, we integrate two specialized loss functions to enable flexible control and to adeptly manage conflicting conditions. In the example depicted here, the new parameter size is efficiently condensed to $4+6n$, where $n$ denotes the number of decomposed matrix pairs.
Figure 3: The visualization of decomposed shared “slow” weights (right image) for single condition case where the input condition (left image) is the depth map and the input text prompt is Car. We took the average over the decomposed shared weights of the last cross-attention block across all attention heads in Stable Diffusion.
Figure 4: Qualitative comparison of FlexEControl and existing controllable diffusion models with multiple heterogeneous conditions. First row: FlexEControl effectively integrates both the segmentation and edge maps to generate a coherent image while Uni-ControlNet and LoRA miss the segmentation map and Uni-Control generates a messy image. Second row: The input condition types are one depth map and one sketch map. FlexEControl can do more faithful generation while all three others generate the candle in the coffee.
Figure 5: Qualitative comparison of FlexEControl and existing controllable diffusion models with single condition. Text prompt: A bed. The image quality of FlexEControl is comparable to existing methods and Uni-ControlNet + LoRA, while FlexEControl has much more efficiency.
...and 1 more figures

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

TL;DR

Abstract

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)