Table of Contents
Fetching ...

ROAM: Robust and Object-Aware Motion Generation Using Neural Pose Descriptors

Wanyue Zhang, Rishabh Dabral, Thomas Leimkühler, Vladislav Golyanik, Marc Habermann, Christian Theobalt

TL;DR

ROAM tackles the challenge of producing realistic 3D character motion with scene interactions that generalise to unseen objects. It achieves this by decoupling goal-pose estimation from motion generation: an SE(3)-equivariant Neural Descriptor Field around the target object and a Skeletal Pose Descriptor transfer pose information from a reference object, optimised via a descriptor-based energy $E_d$ together with pose priors. The goal-pose synthesis (GPS) is followed by a lightweight Neural State Machine (l-NSM) with bidirectional pose blending to generate seamless, realistic motion from standing to interaction poses, trained with only a single exemplar per object category. Empirical results—comprising perceptual studies, quantitative pose-errors, and ablations—demonstrate robust generalisation to unseen chair/sofa geometries and superior motion quality compared to state-of-the-art, enabling scalable, data-efficient object-aware animation for virtual environments.

Abstract

Existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalise well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. This paper addresses this limitation and shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. Given an unseen object and a reference pose-object pair, we optimise for the object-aware pose that is closest in the feature space to the reference pose. Finally, we use l-NSM, i.e., our motion generation model that is trained to seamlessly transition from locomotion to object interaction with the proposed bidirectional pose blending scheme. Through comprehensive numerical comparisons to state-of-the-art methods and in a user study, we demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects. Our project page is available at https://vcai.mpi-inf.mpg.de/projects/ROAM/.

ROAM: Robust and Object-Aware Motion Generation Using Neural Pose Descriptors

TL;DR

ROAM tackles the challenge of producing realistic 3D character motion with scene interactions that generalise to unseen objects. It achieves this by decoupling goal-pose estimation from motion generation: an SE(3)-equivariant Neural Descriptor Field around the target object and a Skeletal Pose Descriptor transfer pose information from a reference object, optimised via a descriptor-based energy together with pose priors. The goal-pose synthesis (GPS) is followed by a lightweight Neural State Machine (l-NSM) with bidirectional pose blending to generate seamless, realistic motion from standing to interaction poses, trained with only a single exemplar per object category. Empirical results—comprising perceptual studies, quantitative pose-errors, and ablations—demonstrate robust generalisation to unseen chair/sofa geometries and superior motion quality compared to state-of-the-art, enabling scalable, data-efficient object-aware animation for virtual environments.

Abstract

Existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalise well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. This paper addresses this limitation and shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. Given an unseen object and a reference pose-object pair, we optimise for the object-aware pose that is closest in the feature space to the reference pose. Finally, we use l-NSM, i.e., our motion generation model that is trained to seamlessly transition from locomotion to object interaction with the proposed bidirectional pose blending scheme. Through comprehensive numerical comparisons to state-of-the-art methods and in a user study, we demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects. Our project page is available at https://vcai.mpi-inf.mpg.de/projects/ROAM/.
Paper Structure (21 sections, 13 equations, 10 figures, 2 tables)

This paper contains 21 sections, 13 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: ROAM generates sitting and lying poses on a large variety of chairs and couches. It is $\mathsf{SE(3)}$-equivariant and robust to out-of-distribution objects of the same category. ROAM also allows interactive character control without requiring the model to be trained on mocap data with subjects interacting with diverse chairs and couches.
  • Figure 2: Method overview. Given a starting standing pose, a sitting pose on a reference chair as well as a novel object (not seen during training), our goal is to synthesise a character animation, which approaches the novel object (either chair or sofa) and then sits or lies on it while the motion should be aware of the shape of the novel object. We achieve this by our proposed three-stage procedure. First, we propose a skeletal pose field, which adapts the sitting pose on the reference chair to the novel chair in a shape-aware manner while not requiring any manual labelling. We call this pose the goal sitting pose. Second, l-NSM generates an approaching and interaction motion towards the novel object.
  • Figure 3: Goal Pose Synthesis (GPS) module. For the given reference object-pose pair and a target object, we generate neural descriptors $\mathbf{Z}(\mathcal{O}, P)$ and $\mathbf{Z}(\mathcal{O}^{\prime}, P^{\prime})$ that represent the relative geometry of the object and the pose, thereby allowing us to optimise the target pose $P^{\prime}$ (Sec \ref{['sec:gps']}).
  • Figure 4: Results of the perceptual study for qualitative comparison of ROAM against the state of the art. Our method is consistently preferred over all other approaches, in terms of, both, semantic coherence with the chair's geometry, and motion realism.
  • Figure 5: Qualitative results of the Skeletal Pose Descriptor-based goal pose optimisation (Sec. \ref{['sec:gps']}). Given a reference pose for a reference chair (top row), our method can adjust the reference pose to an unseen chair (second and third row). We show a variety of poses in the range from sitting on chairs (first column), sitting on sofas (middle column) and lying on sofas (last column). Please watch our supplementary video for animations of the optimisation process.
  • ...and 5 more figures