Table of Contents
Fetching ...

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, Siyuan Huang

TL;DR

This work shows that Distance Field (DF) provides a unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy, and offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.

Abstract

Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

TL;DR

This work shows that Distance Field (DF) provides a unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy, and offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.

Abstract

Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
Paper Structure (26 sections, 12 equations, 6 figures, 6 tables)

This paper contains 26 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 0: Generalizable long-horizon humanoid interaction via LessMimic. A single df-conditioned policy supports (a) online failure recovery through continuous geometric feedback, (b) generalization to unseen object shapes and scales without retraining, and (c) long-horizon composition of heterogeneous interaction skills within a single policy.
  • Figure 1: LessMimic framework overview. The policy takes as input a root trajectory command, humanoid proprioception, and a unified df-based interaction representation that captures current humanoid-object spatial- and temporal-relation. The representation is constructed from mocap or depth image and encoded into a compact latent $z_t$ via a vae. The policy is trained in two stages (interaction skill pre-training and discriminative post-training) and outputs actions to a whole-body controller at deployment.
  • Figure 2: Training pipeline of LessMimic. (a) Object observations from either mocap or an egocentric depth camera yield per-link df features $\mathbf{u}_t$, which are velocity-decomposed and encoded into an interaction latent representation $z_t$ via a vae. (b) During interaction skill pre-training, a teacher policy $\pi_{\text{mimic}}$ tracks retargeted human motions to generate physically valid data, from which $\pi_{\text{base}}$ is trained via behavior cloning without access to reference motions, but takes root trajectory command $c^{\text{root}}_t$, humanoid proprioception $o^{\text{prop}}$ and the df latent $z_t$. (c) During discriminative post-training, $\pi_{\text{base}}$ is fine-tuned with rl guided by aip, a discriminator that regularizes interaction validity in the geometric domain across randomized object geometries, yielding $\pi_{\text{full}}$. (d) During visual-motor distillation, $\pi_{\text{full}}$ is distilled into $\pi_{\text{vis}}$ via DAgger-style supervision, replacing mocap inputs with egocentric depth features for portable real-world deployment.
  • Figure 3: An example of the df signals during a sitting interaction. The blue curve shows the mean df distance between the humanoid and the chair across all joints, with the shaded region indicating the full range; the red curve shows the mean df gradient magnitude. As the humanoid approaches and makes contact with the chair, the distance decreases while the gradient magnitude increases, reflecting the intensifying geometric coupling. Vertical dashed lines indicate transitions between interaction phases (Stand, Sit, Sit, Stand).
  • Figure 4: Real-world generalization of LessMimic. (a) The policy successfully picks up a box, one of the training geometries. (b) The same policy generalizes to a soccer ball---a spherical object entirely unseen during training---demonstrating shape generalization beyond the training distribution. (c) The policy performs SitStand across two chair heights ($12\,\mathrm{cm}$ and $46\,\mathrm{cm}$), maintaining stable pelvis contact across diverse seat geometries.
  • ...and 1 more figures