Table of Contents
Fetching ...

Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition

Sina Malakouti, Adriana Kovashka

TL;DR

This work addresses directional role bias in action-based relations generated by text-to-image diffusion models, showing that rare prompts often collapse to frequent counterparts (role-collapse). It introduces RoleBench to quantify this phenomenon and proposes ReBind, a lightweight framework that decomposes rare relations into active and passive intermediate triplets generated by an LLM and fine-tuned via LoRA to reinforce correct role bindings. The method demonstrates significant bias reduction (lower $\beta$) and improved alignment, with human evaluators preferring ReBind over multiple baselines, while maintaining performance on frequent compositions. The findings highlight distributional asymmetries as a key factor in compositional failures and offer a practical, low-overhead approach to improve generalization in creative image synthesis.

Abstract

Text-to-image (T2I) diffusion models exhibit impressive photorealistic image generation capabilities, yet they struggle in compositional image generation. In this work, we introduce RoleBench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., "mouse chasing cat"). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., "cat chasing mouse"), a phenomenon we call role collapse. Related works attribute this to the model's architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods. Our findings highlight the role of distributional asymmetries in compositional failures and offer a simple, effective path for improving generalization.

Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition

TL;DR

This work addresses directional role bias in action-based relations generated by text-to-image diffusion models, showing that rare prompts often collapse to frequent counterparts (role-collapse). It introduces RoleBench to quantify this phenomenon and proposes ReBind, a lightweight framework that decomposes rare relations into active and passive intermediate triplets generated by an LLM and fine-tuned via LoRA to reinforce correct role bindings. The method demonstrates significant bias reduction (lower ) and improved alignment, with human evaluators preferring ReBind over multiple baselines, while maintaining performance on frequent compositions. The findings highlight distributional asymmetries as a key factor in compositional failures and offer a practical, low-overhead approach to improve generalization in creative image synthesis.

Abstract

Text-to-image (T2I) diffusion models exhibit impressive photorealistic image generation capabilities, yet they struggle in compositional image generation. In this work, we introduce RoleBench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., "mouse chasing cat"). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., "cat chasing mouse"), a phenomenon we call role collapse. Related works attribute this to the model's architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods. Our findings highlight the role of distributional asymmetries in compositional failures and offer a simple, effective path for improving generalization.

Paper Structure

This paper contains 21 sections, 3 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Directional bias in action generation. (a) Role collapse: T2I models reliably generate frequent compositions (e.g., "cat chasing mouse") but fail on rare cases (e.g., "mouse chasing cat"), defaulting to the frequent form. (b) Intermediate compositions (e.g., "mouse chasing boy") can be used to enable models to correctly depict rare compositions. Colors: correct / incorrect generation.
  • Figure 2: Overview of ReBind. Our method enhances rare composition generation by introducing structured e steps: (1) Role Decomposition via LLM-generated active/passive triplets to enforce correct role binding, and (2) Intermediate Image Generation using spatially-aware prompts. These images are later used for LoRA fine-tuning to mitigate role collapse.
  • Figure 3: ReBind is more effective than compositional generation methods. Humans significantly prefer ReBind in a head-to-head comparison compared to baselines. Numbers in %.
  • Figure 4: ReBind maintains negative $\beta$ on frequent compositions. $\beta$ scores in %; FE: Facial Expression, VS: VQAScore.
  • Figure 5: Qualitative comparison of ReBind. Baselines often collapse to frequent compositions, while ReBind better captures rare compositions. Top to bottom: mouse chasing cat, horse riding astronaut, boy feeding woman, scientist carrying fireman, monkey lifting zoo trainer.
  • ...and 8 more figures