Table of Contents
Fetching ...

Guiding Human-Object Interactions with Rich Geometry and Relations

Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, Changxing Ding

TL;DR

ROG addresses the fidelity gap in human-object interaction synthesis by integrating rich object geometry into an Interactive Distance Field (IDF) and a diffusion-based relation model. It samples boundary-focused and Poisson-disk keypoints to form a dense, informative object representation, enabling robust spatial-temporal understanding when paired with a human skeleton and motion diffusion backbone. A learned Relation Model with spatial-temporal attention refines the IDF, and a targeted guidance mechanism using IDF priors improves realism and semantic alignment, achieving state-of-the-art results on FullBodyManipulation. The approach advances HOI realism without relying on simplistic centroids, offering practical improvements for VR, animation, and robotics with scalable geometry-aware motion generation.

Abstract

Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Guiding Human-Object Interactions with Rich Geometry and Relations

TL;DR

ROG addresses the fidelity gap in human-object interaction synthesis by integrating rich object geometry into an Interactive Distance Field (IDF) and a diffusion-based relation model. It samples boundary-focused and Poisson-disk keypoints to form a dense, informative object representation, enabling robust spatial-temporal understanding when paired with a human skeleton and motion diffusion backbone. A learned Relation Model with spatial-temporal attention refines the IDF, and a targeted guidance mechanism using IDF priors improves realism and semantic alignment, achieving state-of-the-art results on FullBodyManipulation. The approach advances HOI realism without relying on simplistic centroids, offering practical improvements for VR, animation, and robotics with scalable geometry-aware motion generation.

Abstract

Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Paper Structure

This paper contains 45 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Novelty analysis. L2 distances of 512 test cases vs. top-3 training neighbors. Red lines mark intra-trainset distance, demonstrating generation beyond training data.
  • Figure 2: Overview of ROG. Given an object, ROG first extracts key points that comprehensively represent the object's geometry. These object key points, along with human key points, a text prompt, and the diffusion step $t$, are then input into ROG to generate human-object interactions that are semantically aligned with the text prompt. During each denoising step, the motion generation model initially produces movements $\tilde{\mathbf{m}}_0 = \{\tilde{\mathbf{m}}_\text{hm}, \tilde{\mathbf{m}}_\text{obj}\}$ for both the human and the object. The relation model then uses the Interactive Distance Field (IDF) $\mathbf{D}$ derived from these initial movements to output a refined IDF $\tilde{\mathbf{D}}$. This refined IDF guides the enhancement of the initial movements, improving their quality.
  • Figure 2: User study. We generated HOIs for 15 captions using 4 methods and asked 20 users to rank them by text alignment and realism. Our method outperforms others in both aspects.
  • Figure 3: Qualitative comparisons. We use circles to highlight incorrect interactions, illustrating that our method can generate more realistic and physically plausible interactions that align with the given text.
  • Figure 3: Our model generates semantically accurate and consistent human-object interactions across various objects.
  • ...and 3 more figures