Table of Contents
Fetching ...

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Taihang Hu, Linxuan Li, Joost van de Weijer, Hongcheng Gao, Fahad Shahbaz Khan, Jian Yang, Ming-Ming Cheng, Kai Wang, Yaxing Wang

TL;DR

This paper introduces a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token, which ensures that the object, its attributes and sub-objects all share the same cross-attention map.

Abstract

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}.

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

TL;DR

This paper introduces a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token, which ensures that the object, its attributes and sub-objects all share the same cross-attention map.

Abstract

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}.

Paper Structure

This paper contains 23 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Current state-of-the-art T2I models often struggle with semantic binding in generated images according to textual prompts. For example, hats and sunglasses are placed on incorrect objects. We introduce a novel method ToMe to address these challenges.
  • Figure 2: We generate images with various input prompts in (a): "a cat wearing sunglasses and a dog wearing a hat"; the single-token embedding [dog]; the end token [EOT] . (b) After that, we compute the probability of containing "sunglasses" in the generated images in subfigure .
  • Figure 3: (a) Image generations with the property of token additivity. All images are generated by the prompt template "a photo of a {object}." (b) PCA plot for additivity of text embeddings.
  • Figure 4: ToMe is composed of two parts: one with Token Merging and end token substitution, and the other token updating part with two auxiliary losses for iterative composite token update.
  • Figure 5: Qualitative comparison among various T2I generation methods with complex prompts.
  • ...and 10 more figures