VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim
TL;DR
VerbDiff addresses the core challenge of accurately modeling human–object interactions in text-to-image diffusion by disentangling interaction verbs from anchor objects and emphasizing localized interaction cues. It introduces Relation Disentanglement Guidance (RDG) to reduce verb–object bias via a frequency-based anchor and a triplet/image-alignment loss, and Interaction Direction Guidance (IDG) with an Interaction Region (IR) module that extracts region-centroids from cross-attention maps to guide region-focused generation. The training optimizes only cross-attention layers with a reconstruction objective plus RDG and IDG losses, balancing long-tail verb distributions with an adaptive effective number $\alpha(k)$. Evaluations on HICO-DET with multiple metrics including CLIP/S-BERT similarities, HOI accuracy, and VQA-style scores show VerbDiff outperforms prior text-only and grounding-based methods in interaction fidelity, while maintaining high image quality. The approach enables accurate, condition-free generation of nuanced human–object interactions and offers a robust framework for semantically aware diffusion models.
Abstract
Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
