Table of Contents
Fetching ...

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui

TL;DR

InterMimic tackles the challenge of learning realistic, physics-based full-body human-object interactions from imperfect MoCap data by introducing a curriculum-driven teacher-student framework. Subject-specific teacher policies refine and retarget demonstrations before distilling their knowledge into a scalable Transformer-based student policy that is RL-fine-tuned for broad generalization, including zero-shot performance with unseen objects and integration with kinematic generators. The approach leverages contact-guided rewards, physical state initialization, and interaction-aware termination to address MoCap artifacts and achieve stable, diverse HOI skills across dynamic objects. This work advances from imitation to generative-like HOI by enabling scalable skill learning, retargeting, and cross-domain generation, with practical implications for humanoid robots and future text-to-HOI or interaction-prediction systems.

Abstract

Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

TL;DR

InterMimic tackles the challenge of learning realistic, physics-based full-body human-object interactions from imperfect MoCap data by introducing a curriculum-driven teacher-student framework. Subject-specific teacher policies refine and retarget demonstrations before distilling their knowledge into a scalable Transformer-based student policy that is RL-fine-tuned for broad generalization, including zero-shot performance with unseen objects and integration with kinematic generators. The approach leverages contact-guided rewards, physical state initialization, and interaction-aware termination to address MoCap artifacts and achieve stable, diverse HOI skills across dynamic objects. This work advances from imitation to generative-like HOI by enabling scalable skill learning, retargeting, and cross-domain generation, with practical implications for humanoid robots and future text-to-HOI or interaction-prediction systems.

Abstract

Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

Paper Structure

This paper contains 30 sections, 5 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: InterMimic enables physically simulated humans to perform interactions with dynamic and diverse objects. It supports highly-dynamic, multi-object interactions and scalable skill learning (Top), making it adaptable for versatile downstream applications (Bottom): it can translate whole-body loco-manipulation skills to a humanoid robot unitreeg1inspire, perfect interaction MoCap data, and bridge kinematic generation, e.g., predicting future interactions from past (InterDiff xu2023interdiff) or generating interactions given text prompts (InterDreamer xu2024interdreamer).
  • Figure 2: Our two-stage pipeline: (i) training each teacher policy (MLP) on a small data subset with initialization corrected via Physical State Initialization (PSI), and (ii) freezing the teacher policies to provide refined references for training a student policy (Transformer). The student leverages teacher supervision for effective scaling and is fine-tuned through RL.
  • Figure 3: (i) Visualization of reference contact markers that accommodate varied contact distances: red to promote contact, green for neutral areas where contact is neither promoted nor penalized, and blue to penalize contact. (ii) Initializing the rollout with reference (RSI) or reference corrected via simulation (PSI).
  • Figure 4: Qualitative comparison between PhysHOI wang2023physhoi (top), the reference motion (middle) from the BEHAVE bhatnagar22behave dataset, and the interaction refined by our teacher trained on it (bottom). InterMimic faithfully imitates the interactions involving multiple body parts while correcting errors in the original reference.
  • Figure 5: We recover plausible object rotations (bottom) that are challenging for motion capture due to the equivariant geometries of objects, which result in the object sliding on the ground (top).
  • ...and 9 more figures