Table of Contents
Fetching ...

Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo

TL;DR

This work defines Human-Human-Object Interactions (HHOIs) and develops a text-conditioned diffusion framework that jointly models HOI and HHI to generate coherent multi-person interactions around shared objects. It decouples HOI and HHI into two diffusion models trained with denoising score matching and employs guided sampling with inconsistency and collision losses to ensure consistency and plausibility across multiple humans. A new multi-view HHOI dataset, integration with CORE4D, and synthetic data pipelines enable robust training and evaluation for dyadic and multi-human interactions, with demonstrations of improved realism and semantic alignment over baselines. The approach enables interaction-aware multi-human motion generation, advancing embodied AI capabilities in socially complex scenes, while highlighting avenues for future dataset integration and model refinement.

Abstract

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Learning to Generate Human-Human-Object Interactions from Textual Descriptions

TL;DR

This work defines Human-Human-Object Interactions (HHOIs) and develops a text-conditioned diffusion framework that jointly models HOI and HHI to generate coherent multi-person interactions around shared objects. It decouples HOI and HHI into two diffusion models trained with denoising score matching and employs guided sampling with inconsistency and collision losses to ensure consistency and plausibility across multiple humans. A new multi-view HHOI dataset, integration with CORE4D, and synthetic data pipelines enable robust training and evaluation for dyadic and multi-human interactions, with demonstrations of improved realism and semantic alignment over baselines. The approach enables interaction-aware multi-human motion generation, advancing embodied AI capabilities in socially complex scenes, while highlighting avenues for future dataset integration and model refinement.

Abstract

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Paper Structure

This paper contains 26 sections, 16 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Results of our HHOI generation given object instances and text-prompt descriptions. Multiple humans in action are generated by jointly enforcing scene-level consistency across human-object interactions (HOIs) and human-human interactions (HHIs).
  • Figure 2: Method Overview. (a) The training and inference process of the HOI/HHI part. (b) The advanced HHOI sampling process by introducing inconsistency loss and collision loss.
  • Figure 3: HHOI generation result of dyadic, and multiple humans in action with our model and baselines. In multiple HHOI, number of humans ranges from 3 to 5. Empty result represents cases where generation failed in 10 trials. Our model can generate complex HHOIs with varying number of humans in the scene, while preserving the natural social cues.
  • Figure 4: Motion in-betweening outputs from DNO and InterGen, given a naive standing pose as the start frame constraint and our HHOI generation output as the end frame constraint.
  • Figure 5: HHOIs Capture System Overview. We capture Human-Human-Object Interactions (HHOIs) with our multiple camera capture system. The object and human poses are tracked with AruCo markers garrido2014automatic_aruco and DWPose dwpose respectively.
  • ...and 8 more figures