Table of Contents
Fetching ...

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim

TL;DR

VerbDiff addresses the core challenge of accurately modeling human–object interactions in text-to-image diffusion by disentangling interaction verbs from anchor objects and emphasizing localized interaction cues. It introduces Relation Disentanglement Guidance (RDG) to reduce verb–object bias via a frequency-based anchor and a triplet/image-alignment loss, and Interaction Direction Guidance (IDG) with an Interaction Region (IR) module that extracts region-centroids from cross-attention maps to guide region-focused generation. The training optimizes only cross-attention layers with a reconstruction objective plus RDG and IDG losses, balancing long-tail verb distributions with an adaptive effective number $\alpha(k)$. Evaluations on HICO-DET with multiple metrics including CLIP/S-BERT similarities, HOI accuracy, and VQA-style scores show VerbDiff outperforms prior text-only and grounding-based methods in interaction fidelity, while maintaining high image quality. The approach enables accurate, condition-free generation of nuanced human–object interactions and offers a robust framework for semantically aware diffusion models.

Abstract

Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

TL;DR

VerbDiff addresses the core challenge of accurately modeling human–object interactions in text-to-image diffusion by disentangling interaction verbs from anchor objects and emphasizing localized interaction cues. It introduces Relation Disentanglement Guidance (RDG) to reduce verb–object bias via a frequency-based anchor and a triplet/image-alignment loss, and Interaction Direction Guidance (IDG) with an Interaction Region (IR) module that extracts region-centroids from cross-attention maps to guide region-focused generation. The training optimizes only cross-attention layers with a reconstruction objective plus RDG and IDG losses, balancing long-tail verb distributions with an adaptive effective number . Evaluations on HICO-DET with multiple metrics including CLIP/S-BERT similarities, HOI accuracy, and VQA-style scores show VerbDiff outperforms prior text-only and grounding-based methods in interaction fidelity, while maintaining high image quality. The approach enables accurate, condition-free generation of nuanced human–object interactions and offers a robust framework for semantically aware diffusion models.

Abstract

Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

Paper Structure

This paper contains 29 sections, 11 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Generated samples illustrating multiple human-object interactions. Each color represents distinct humans, objects, and interaction words. GLIGEN li2023gligen and InteractDiffusion hoe2024interactdiffusion use grounding boxes as additional conditions, whereas Stable Diffusion rombach2022high and VerbDiff rely solely on text.
  • Figure 2: Examples of interactions from InteractDiffusion hoe2024interactdiffusion, showing a lack of understanding of interaction words. The model relies on precise bounding boxes rather than understanding interaction words to exhibit accurate interactions, as shown in the results for the “red” and “blue” boxes.
  • Figure 3: Pipeline of VerbDiff. VerbDiff uses Relation Disentanglment Guidance (Sec. \ref{['subsec:RDG']}) that separates the interaction features from the anchor text for each human-object pair. Additionally, it contains the IR module (Sec. \ref{['subsec:IDG']}) (right) which extracts localized interaction regions from generated images without explicit bounding boxes and Interaction Direction Guidance (Sec. \ref{['subsec:IDG']}) that guides the model to focus more on the fine-grained interaction regions.
  • Figure 4: Interaction comparison in generated images with other models. We generate images using a fixed template: "A {H} {R} a/an {O}". The top row displays the input human, interaction word, and object. Green and blue boxes represent additional grounding boxes used during image generation for each human and object, respectively. Our results produce images with more accurate interactions than other models, closely resembling ground-truth interactions while maintaining high image quality comparable to large-scale generative models like DALL·E 3 betker2023improving.
  • Figure 5: Image interaction comparison with and without guidance. IDG focuses on the specific localized interaction region (red box) in the image.
  • ...and 8 more figures