TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Teng-Fang Hsiao; Bo-Kai Ruan; Yi-Lun Wu; Tzu-Ling Lin; Hong-Han Shuai

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai

TL;DR

This work introduces TF-TI2I, a training-free Text-and-Image-to-Image pipeline that leverages MM-DiT's implicit-context learning to fuse multiple image references into text-to-image generation without fine-tuning. It proposes Contextual Tokens Sharing (CTS), Reference Contextual Masking (RCM), and Winner-Takes-All (WTA) to manage multi-reference conditioning, and introduces FG-TI2I Bench for fine-grained TI2I evaluation. The approach demonstrates strong performance across multiple TI2I tasks, achieving state-of-the-art results on 12 of 18 FG-TI2I metrics and competitive results on related benchmarks like DreamBench and Wild-TI2I, while maintaining training efficiency. Limitations include residual inter-reference interference due to limited precision in RCM, pointing to future work on improving semantic discrimination and control in multi-reference TI2I.

Abstract

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

TL;DR

Abstract

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)