Table of Contents
Fetching ...

Generating Continual Human Motion in Diverse 3D Scenes

Aymen Mir, Xavier Puig, Angjoo Kanazawa, Gerard Pons-Moll

TL;DR

This work tackles continual human motion synthesis in diverse 3D scenes by decoupling scene reasoning from motion generation. It introduces action keypoints and a goal-centric canonical coordinate frame, enabling long-range motion using scene-agnostic mocap data. Two transformers, WalkNet and TransNet, generate walking and in-betweening transitions, trained entirely on mocap data and conditioned through anchor poses derived from keypoints. The approach generalizes across multiple real-world scene datasets and outperforms baselines in realism and scene constraint satisfaction, offering a scalable pathway for animator-guided motion in arbitrary environments.

Abstract

We introduce a method to synthesize animator guided human motion across 3D scenes. Given a set of sparse (3 or 4) joint locations (such as the location of a person's hand and two feet) and a seed motion sequence in a 3D scene, our method generates a plausible motion sequence starting from the seed motion while satisfying the constraints imposed by the provided keypoints. We decompose the continual motion synthesis problem into walking along paths and transitioning in and out of the actions specified by the keypoints, which enables long generation of motions that satisfy scene constraints without explicitly incorporating scene information. Our method is trained only using scene agnostic mocap data. As a result, our approach is deployable across 3D scenes with various geometries. For achieving plausible continual motion synthesis without drift, our key contribution is to generate motion in a goal-centric canonical coordinate frame where the next immediate target is situated at the origin. Our model can generate long sequences of diverse actions such as grabbing, sitting and leaning chained together in arbitrary order, demonstrated on scenes of varying geometry: HPS, Replica, Matterport, ScanNet and scenes represented using NeRFs. Several experiments demonstrate that our method outperforms existing methods that navigate paths in 3D scenes. For more results we urge the reader to watch our supplementary video available at: https://www.youtube.com/watch?v=0wZgsdyCT4A&t=1s

Generating Continual Human Motion in Diverse 3D Scenes

TL;DR

This work tackles continual human motion synthesis in diverse 3D scenes by decoupling scene reasoning from motion generation. It introduces action keypoints and a goal-centric canonical coordinate frame, enabling long-range motion using scene-agnostic mocap data. Two transformers, WalkNet and TransNet, generate walking and in-betweening transitions, trained entirely on mocap data and conditioned through anchor poses derived from keypoints. The approach generalizes across multiple real-world scene datasets and outperforms baselines in realism and scene constraint satisfaction, offering a scalable pathway for animator-guided motion in arbitrary environments.

Abstract

We introduce a method to synthesize animator guided human motion across 3D scenes. Given a set of sparse (3 or 4) joint locations (such as the location of a person's hand and two feet) and a seed motion sequence in a 3D scene, our method generates a plausible motion sequence starting from the seed motion while satisfying the constraints imposed by the provided keypoints. We decompose the continual motion synthesis problem into walking along paths and transitioning in and out of the actions specified by the keypoints, which enables long generation of motions that satisfy scene constraints without explicitly incorporating scene information. Our method is trained only using scene agnostic mocap data. As a result, our approach is deployable across 3D scenes with various geometries. For achieving plausible continual motion synthesis without drift, our key contribution is to generate motion in a goal-centric canonical coordinate frame where the next immediate target is situated at the origin. Our model can generate long sequences of diverse actions such as grabbing, sitting and leaning chained together in arbitrary order, demonstrated on scenes of varying geometry: HPS, Replica, Matterport, ScanNet and scenes represented using NeRFs. Several experiments demonstrate that our method outperforms existing methods that navigate paths in 3D scenes. For more results we urge the reader to watch our supplementary video available at: https://www.youtube.com/watch?v=0wZgsdyCT4A&t=1s
Paper Structure (26 sections, 6 equations, 6 figures, 3 tables)

This paper contains 26 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 2: Overview of our method. We generate human motion satisfying keypoint constraints by diving it into 3 stages: a Walk Motion, which animates the human as it walks between keypoints, a Transition-In, which blends the walking motion with the pose specified by the keypoints and a Transition-Out, which animates the human back to the walking pose. We use an autoregressive transformer, WalkNet, to synthesize the walking motion, and a masked-autoencoder transformer to generate the blending motion. By moving the motion into a Goal-Centric Canonical Coordinate Frame our method can generalize to a wide set of 3D scenes.
  • Figure 3: a) Using keypoints and tangents along a path, we move motion from the scene coordinate frame into b) the goal-centric canonical coordinate frame, where c) WalkNet synthesizes motion that converges at the origin of the coordinate frame. d) Once the synthesized motion reaches the origin, we move it back to the scene coordinate frame.
  • Figure 4: Using language instruction and semantic segmentation, keypoints can be automatically placed in a 3D scene.
  • Figure 5: Using a) the motion-anchor pose in the 3D scene (purple), b) we move the motion sequence into the canonical coordinate frame. c) There TransNet synthesizes transitions (blue) between the input motion and the pose placed at the origin (purple). d) Once the motion is synthesized, we move it back to the scene coordinate frame.
  • Figure 6: Our method allows to generate motion that generalizes across different scenes. Here we show motion generation in scenes from 4 different datasets: Replica replica19arxiv, Matterport niessner2017Matterport3D, HPS mir20hps and Scannet dai2017scannet.
  • ...and 1 more figures