Table of Contents
Fetching ...

Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji, Ye Shi, Zichen Jin, Kangyi Chen, Lan Xu, Yuexin Ma, Jingyi Yu, Jingya Wang

TL;DR

The paper tackles the challenge of real-time, physically plausible interactions between humans and diverse agents (avatars, humanoids, robots) in immersive settings. It introduces Human-X, a real-time auto-regressive action-reaction diffusion framework coupled with an actor-aware tracking policy to ensure safety, physical realism, and responsiveness. Through diffusion-based reaction generation, reactor-centric representations, and a physics-tracking policy, the method achieves superior motion quality, continuity, and interaction realism on Inter-X and InterHuman datasets, and is demonstrated in VR and human-robot interface scenarios. The work advances practical human-machine collaboration by providing a latency-friendly, physics-consistent synthesis pipeline with extensive ablations and user evaluations supporting its effectiveness.

Abstract

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

TL;DR

The paper tackles the challenge of real-time, physically plausible interactions between humans and diverse agents (avatars, humanoids, robots) in immersive settings. It introduces Human-X, a real-time auto-regressive action-reaction diffusion framework coupled with an actor-aware tracking policy to ensure safety, physical realism, and responsiveness. Through diffusion-based reaction generation, reactor-centric representations, and a physics-tracking policy, the method achieves superior motion quality, continuity, and interaction realism on Inter-X and InterHuman datasets, and is demonstrated in VR and human-robot interface scenarios. The work advances practical human-machine collaboration by providing a latency-friendly, physics-consistent synthesis pipeline with extensive ablations and user evaluations supporting its effectiveness.

Abstract

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

Paper Structure

This paper contains 64 sections, 11 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: We propose Human-X, the first framework designed to enable latency-free interaction between humans and diverse entities, including human-avatar, human-humanoid, and human-robot interaction.
  • Figure 2: Overview of our immersive real-time interaction synthesis pipeline: (a) Actor Motion Capture: A human actor’s movements are recorded at 30 fps by an RGB-D camera and translated into 3D poses, which are then retargeted to a humanoid character. (b) Realistic Reactor Motion Generation: An auto-regressive diffusion model, guided by optional text prompts (e.g., “Dancing is what to do”), generates plausible reaction motions. These motions are tracked by an actor-aware controller, which uses proprioception signals to ensure realistic, synchronized interactions. (c) Real-time VR Interface: The generated and tracked motions are rendered in simulator, providing both a third-person view and a binocular VR view.
  • Figure 3: Compared to CAMDM (top row), Human-X (bottom row) achieves more complete hand contact in tasks such as face-hitting and handshaking. Additionally, its foot movement appears more natural, as highlighted in the red and green circles.
  • Figure 4: Visualization results on Human-Robot Interaction. The robot (black skeleton) and human (orange mesh) perform a handshake on a flat plane, from arm extension and palm contact through to shake completion, illustrating our method’s spatial coordination and motion coherence.
  • Figure 5: Extensive experimental results indicate that participants perceive our method to perform better in all three metrics: Diversity, Consistency, and Authenticity.