Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

Chenyi Zhuang; Ying Hu; Pan Gao

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

Chenyi Zhuang, Ying Hu, Pan Gao

TL;DR

This work identifies attribute bias and padding-context entanglement in CLIP-based text encodings as key culprits behind improper attribute binding in text-to-image diffusion models. It introduces Magnet, a training-free method that uses positive and negative binding vectors, augmented by a neighbor-guided estimation strategy, to disentangle object-attribute bindings within the textual space. Through human-centric evaluations on ABC-6K and CC-500, Magnet improves synthesis quality and attribute alignment with minimal computational overhead, enabling anti-prior and unconventional concept generation. The approach is modular, compatible with optimization-based controls, and broad in applicability across CLIP-based encoders, offering a practical path to more faithful T2I synthesis. Overall, Magnet advances understanding of text_encoder influences on diffusion generation and provides a scalable, plug-and-play solution for robust attribute binding.

Abstract

Text-to-image diffusion models particularly Stable Diffusion, have revolutionized the field of computer vision. However, the synthesis quality often deteriorates when asked to generate images that faithfully represent complex prompts involving multiple attributes and objects. While previous studies suggest that blended text embeddings lead to improper attribute binding, few have explored this in depth. In this work, we critically examine the limitations of the CLIP text encoder in understanding attributes and investigate how this affects diffusion models. We discern a phenomenon of attribute bias in the text space and highlight a contextual issue in padding embeddings that entangle different concepts. We propose \textbf{Magnet}, a novel training-free approach to tackle the attribute binding problem. We introduce positive and negative binding vectors to enhance disentanglement, further with a neighbor strategy to increase accuracy. Extensive experiments show that Magnet significantly improves synthesis quality and binding accuracy with negligible computational cost, enabling the generation of unconventional and unnatural concepts.

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 28 figures, 5 tables)

This paper contains 33 sections, 4 equations, 28 figures, 5 tables.

Introduction
Analysis of the CLIP text encoder and the diffusion model
Magnet: disentangling concepts with the binding vector
Apply the binding vector on the object embedding
Neighbor-guided vector estimation
Overall workflow
Experiments
Datasets
Metrics
Quantitative comparison
Qualitative comparison
Ablation study
Extensions
Related work
Limitations
...and 18 more sections

Figures (28)

Figure 1: Analysis of the CLIP text encoder for understanding attributes. There is a discrepancy between the word and [EOT] embeddings of the attribute bias on different objects.
Figure 2: (a) Fine-grained study through our designed embedding swapping experiment. The context issue in padding embeddings for (b) single-concept scenario, and (c) multi-concept scenario.
Figure 3: Overview of the proposed Magnet. We manipulate the object embedding with the positive and negative binding vectors, which are estimated with the guidance of neighbor objects.
Figure 4: Qualitative comparison using prompts from ABC-6K and CC-500 datasets. For each prompt, we show the image generated by each method under the same seed.
Figure 5: Prompts with unnatural concepts. Baselines generate exchanged colors (row 1) or unwanted artifacts (row 2) while Magnet demonstrates the anti-prior ability with high-quality outputs.
...and 23 more figures

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

TL;DR

Abstract

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

Authors

TL;DR

Abstract

Table of Contents

Figures (28)