Table of Contents
Fetching ...

UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Mingming Gong, Gui-Song Xia

TL;DR

This work tackles the challenge of achieving precise pixel-level control and global style in text-to-image generation by introducing UNIC-Adapter, a unified image-instruction adapter built on the Multi-Modal-Diffusion Transformer. By fusing task instructions and diverse conditional images through cross-attention (augmented with Rotary Position Embedding), it enables unified controllable generation across 14 input types within a single SD3-based model. The approach is validated across pixel-level spatial control, subject-driven generation, and style-image-based synthesis, with ablations confirming the crucial role of cross-modal interactions, RoPE, and a dedicated L_cross^q layer. The framework reduces training complexity while delivering strong controllability and fidelity, suggesting broad applicability for flexible, single-model T2I systems and potential integration with other diffusion backbones.

Abstract

Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.

UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

TL;DR

This work tackles the challenge of achieving precise pixel-level control and global style in text-to-image generation by introducing UNIC-Adapter, a unified image-instruction adapter built on the Multi-Modal-Diffusion Transformer. By fusing task instructions and diverse conditional images through cross-attention (augmented with Rotary Position Embedding), it enables unified controllable generation across 14 input types within a single SD3-based model. The approach is validated across pixel-level spatial control, subject-driven generation, and style-image-based synthesis, with ablations confirming the crucial role of cross-modal interactions, RoPE, and a dedicated L_cross^q layer. The framework reduces training complexity while delivering strong controllability and fidelity, suggesting broad applicability for flexible, single-model T2I systems and potential integration with other diffusion backbones.

Abstract

Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.

Paper Structure

This paper contains 21 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: With the UNIC-Adapter, SD3 enables flexible and controllable generation across multiple reference modalities within a single model. (a) to (d) represent text prompts, task instructions, conditional images, and generated images, respectively.
  • Figure 2: The overall architecture of our proposed UNIC-Adapter. The task instruction and conditional image features are progressively attending to each other through a series of $N$ adapter blocks. In each adapter block, the image features $Z_{\text{img}}$ from MM-DiT block in the main image generation branch serve as the query, while both task instruction features $Z_{\text{ist}}$ and conditional image features $Z_{\text{con}}$ function as keys and values. For simplicity, normalization layers and feed-forward networks are omitted in this figure.
  • Figure 3: Visualization results of our UNIC-Adapter on twelve pixel-level control tasks from the MultiGen-20M dataset. The first and third rows show different types of conditional images, while the second and fourth rows display the corresponding generated images.
  • Figure 4: Visualization results of our UNIC-Adapter on DreamBench for subject-driven generation. The first column displays the subject images, while the other three columns show the generated images based on different prompts.
  • Figure 5: Visualization results of our UNIC-Adapter on style-image-based T2I generation. The first row shows the reference style image, and each subsequent row contains images generated from the same prompt, influenced by different style images.
  • ...and 6 more figures