Table of Contents
Fetching ...

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai

TL;DR

This work introduces TF-TI2I, a training-free Text-and-Image-to-Image pipeline that leverages MM-DiT's implicit-context learning to fuse multiple image references into text-to-image generation without fine-tuning. It proposes Contextual Tokens Sharing (CTS), Reference Contextual Masking (RCM), and Winner-Takes-All (WTA) to manage multi-reference conditioning, and introduces FG-TI2I Bench for fine-grained TI2I evaluation. The approach demonstrates strong performance across multiple TI2I tasks, achieving state-of-the-art results on 12 of 18 FG-TI2I metrics and competitive results on related benchmarks like DreamBench and Wild-TI2I, while maintaining training efficiency. Limitations include residual inter-reference interference due to limited precision in RCM, pointing to future work on improving semantic discrimination and control in multi-reference TI2I.

Abstract

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

TL;DR

This work introduces TF-TI2I, a training-free Text-and-Image-to-Image pipeline that leverages MM-DiT's implicit-context learning to fuse multiple image references into text-to-image generation without fine-tuning. It proposes Contextual Tokens Sharing (CTS), Reference Contextual Masking (RCM), and Winner-Takes-All (WTA) to manage multi-reference conditioning, and introduces FG-TI2I Bench for fine-grained TI2I evaluation. The approach demonstrates strong performance across multiple TI2I tasks, achieving state-of-the-art results on 12 of 18 FG-TI2I metrics and competitive results on related benchmarks like DreamBench and Wild-TI2I, while maintaining training efficiency. Limitations include residual inter-reference interference due to limited precision in RCM, pointing to future work on improving semantic discrimination and control in multi-reference TI2I.

Abstract

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the TF-TI2I pipeline. TF-TI2I leverages contextual visual information learned from textual tokens. Through sharing the contextual token by concatenating $\tau_{\mathcal{P}}^{l}$ to the upper block (\ref{['sec:CTS']}), we achieve prompt-following while maintaining reference-aligned results. Additionally, we incorporate Reference Contextual Masking (\ref{['sec:RCM']}) to mitigate mutual interference between references and employ the Winner-Takes-All module (\ref{['sec:WTA']}) to minimize distribution shifts in multi-reference scenarios.
  • Figure 2: The t-SNE visualization of $\{\textcolor{#2CA02C}{\tau^{l}_{P_1}} \mid P_1 \in \mathbb{P}\}$ and $\{\textcolor{#FF7F0E}{\tau^{l}_{P_2}} \mid P_2 \in \mathbb{P}\}$ at different timesteps and layers indices. The clusters form in deeper layers and later timesteps.
  • Figure 3: The illustration of replacing contextual tokens between different images. As shown in the \ref{['fig:replace:c']}, we can successfully transfer the visual information of \ref{['fig:replace:b']} to \ref{['fig:replace:a']}.
  • Figure 4: Illustration of sub-tasks in FG-TI2I, where we abbreviate Object, Texture, Action, and Background into O, T, A, and B, respectively. The text-only input is denoted with red, and image-support input is denoted with blue. The input sub-tasks are categorized by the number of image references.
  • Figure 5: Qualitative comparison of Quad-references sub-tasks (left) and Single-reference sub-tasks (right) in FG-TI2I. The input Object, Texture, Action, and Background—are denoted as O, T, A, and B. We use red for text-only input and blue for reference-supported input.
  • ...and 2 more figures