Table of Contents
Fetching ...

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Eric Hanchen Jiang, Yasi Zhang, Zhi Zhang, Yixin Wan, Andrew Lizarraga, Shufan Li, Ying Nian Wu

TL;DR

This work proposes a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization.

Abstract

Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

TL;DR

This work proposes a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization.

Abstract

Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.

Paper Structure

This paper contains 23 sections, 22 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Diffusion models often struggle to accurately represent multiple objects in the input text. We identify the root causes of these challenges and introduce a training-free solution to address them by using PAC-Bayesian Theory. Here, we show some qualitative image results based on our model compared to the original Stable Diffusion model. For instance, when prompted to depict "a mouse wearing a white spacesuit", Stable Diffusion fails to separately associate individual modifiers with each component, therefore neglecting some of the descriptors (i.e. only generating a mouse or a spacesuit).
  • Figure 2: An overview of our workflow for optimizing the stable diffusion model. It includes aggregation of attention maps, computation of object-centric attention loss, and updates to $z_t$.
  • Figure 3: Qualitative comparison on the AnE dataset (the left two columns) and the DVMP dataset (the right two columns). We compared our model using the same prompt and random seed as SD Rombach21, SG Rassin23, AnE Chefer23, and EMAMA Zhang24, with each column sharing the same random seed.
  • Figure 4: Qualitative comparison on the ABC-6K dataset. We compared our model using the same prompt and random seed as SD Rombach21, SG Rassin23, AnE Chefer23, and EMAMA Zhang24, with each column sharing the same random seed.
  • Figure 5: Qualitative Results of the Ablation Study. Each column illustrates the generated image for the prompt 'a baby monkey and a wooden curved crown and an orange guitar,' showing the effects of omitting specific loss components. From left to right: the full model output, the output without the PAC Regularizer, the output without the Outside Loss, and the output without the Similarity Loss. Each ablation demonstrates the impact on attribute-object alignment and overall image coherence, with attribute misbindings and inconsistencies appearing as each component is removed.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 1: PAC-Bayes Bound