Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition
Sina Malakouti, Adriana Kovashka
TL;DR
This work addresses directional role bias in action-based relations generated by text-to-image diffusion models, showing that rare prompts often collapse to frequent counterparts (role-collapse). It introduces RoleBench to quantify this phenomenon and proposes ReBind, a lightweight framework that decomposes rare relations into active and passive intermediate triplets generated by an LLM and fine-tuned via LoRA to reinforce correct role bindings. The method demonstrates significant bias reduction (lower $\beta$) and improved alignment, with human evaluators preferring ReBind over multiple baselines, while maintaining performance on frequent compositions. The findings highlight distributional asymmetries as a key factor in compositional failures and offer a practical, low-overhead approach to improve generalization in creative image synthesis.
Abstract
Text-to-image (T2I) diffusion models exhibit impressive photorealistic image generation capabilities, yet they struggle in compositional image generation. In this work, we introduce RoleBench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., "mouse chasing cat"). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., "cat chasing mouse"), a phenomenon we call role collapse. Related works attribute this to the model's architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods. Our findings highlight the role of distributional asymmetries in compositional failures and offer a simple, effective path for improving generalization.
