Table of Contents
Fetching ...

A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization

Chieh-Yun Chen, Chiang Tseng, Li-Wu Tsao, Hong-Han Shuai

TL;DR

A simple but effective text embedding balance optimization method, which is training-free, and a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments are proposed.

Abstract

This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 125.42% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.

A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization

TL;DR

A simple but effective text embedding balance optimization method, which is training-free, and a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments are proposed.

Abstract

This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 125.42% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.
Paper Structure (38 sections, 6 equations, 18 figures, 6 tables)

This paper contains 38 sections, 6 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Visualization of cross-attention maps when object mixture and missing occur.
  • Figure 2: Overview of the text-to-image generative model, including the details of the causal manner in attention mechanism. Because of the causal nature of the embedding, information is accumulated from the starting token through the end of the sequence, resulting in bias in the earlier token. To balance the critical information, we propose text embedding optimization for purifying the object token with equal weights within their corresponding embedding dimension.
  • Figure 3: Masking text embedding to identify the contribution of critical tokens, e.g., cat/dog, and special tokens, e.g., <sot>, <eot>, <pad>. The first row and the second row both contain cat and dog inside prompt but in different order. The analysis shows that special tokens contain general information about the given prompt. However, the cat/dog tokens carry more weight than the special tokens. In the last two columns, where one of the animal token embeddings is masked while retaining the special tokens' embedding, the generated image is predominantly influenced by the remaining animal's token embedding.
  • Figure 4: Analysis of masking token embeddings. Masking all the given token would reduce the mixture issue but increase the missing issue with balanced object 1 and object 2 existing rate. Masking one of the objects would not completely eliminate the masked object's information but would significantly reduce its existing rate. The implementation details are in Supplement \ref{['sec:masking_token_emb']}.
  • Figure 5: Analysis of Hypothesis. Replacing the token embedding of later mentioned object from the corresponding pure embedding can balance the information but lead to a large drop of two objects coexistence.
  • ...and 13 more figures