Table of Contents
Fetching ...

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang

TL;DR

The paper tackles the challenge of scalable and generalizable HOI synthesis by eliminating manual reward design and enabling long-horizon interaction with varied object types. It introduces a Vision-Language Model–driven Relative Movement Dynamics ($RMD$) planner that outputs structured plans over a bipartite human-object part graph, and an automatic policy learning pipeline that converts those plans into goal states and rewards for a physics-based controller. Key contributions include the $RMD$ representation, the VLM-guided planner, automatic goal/reward construction, and the Interplay dataset, with experiments showing superior performance on both single-task and long-horizon multi-task HOI scenarios across static, dynamic, and articulated objects. The approach bridges high-level semantic reasoning with low-level motion control, enabling natural, transferable HOI motions for animation, simulation, and robotics, and opens avenues for diffusion-based motion diversity and multi-agent extensions.

Abstract

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

TL;DR

The paper tackles the challenge of scalable and generalizable HOI synthesis by eliminating manual reward design and enabling long-horizon interaction with varied object types. It introduces a Vision-Language Model–driven Relative Movement Dynamics () planner that outputs structured plans over a bipartite human-object part graph, and an automatic policy learning pipeline that converts those plans into goal states and rewards for a physics-based controller. Key contributions include the representation, the VLM-guided planner, automatic goal/reward construction, and the Interplay dataset, with experiments showing superior performance on both single-task and long-horizon multi-task HOI scenarios across static, dynamic, and articulated objects. The approach bridges high-level semantic reasoning with low-level motion control, enabling natural, transferable HOI motions for animation, simulation, and robotics, and opens avenues for diffusion-based motion diversity and multi-agent extensions.

Abstract

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Paper Structure

This paper contains 31 sections, 16 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Our framework automatically constructs goal states and reward functions for a variety of interaction tasks in reinforcement learning. Guided by VLMs, the resulting motion policy enables physics-based characters to perform long-horizon interactions with both static and dynamic objects.
  • Figure 2: An overview of our architecture. Receiving instruction and environment context as input, the VLM-Guided RMD Planner generates a multi-step interaction plan in the form of RMD. Based on this plan, our framework automatically designs both goal states and reward functions, enabling the VLM-Guided Motion Policy to execute the interaction step by step.
  • Figure 3: Visualization for qualitative comparison. Other methods exhibit unnatural motion (InterPhys) or incomplete interactions (UniHSI), whereas our method demonstrates human-like motion quality in qualitative assessments. More qualitative visualization videos can be found in the supplementary materials.
  • Figure A1: An overview of VLM-guided RMD Planner pipeline.
  • Figure A2: Details of prompt section.
  • ...and 6 more figures