DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Jingyu Gong; Min Wang; Wentao Liu; Chen Qian; Zhizhong Zhang; Yuan Xie; Lizhuang Ma

DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Jingyu Gong, Min Wang, Wentao Liu, Chen Qian, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

TL;DR

This work proposes the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene, and use it to dynamically update the latent motion for final motion synthesis.

Abstract

Motion synthesis in real-world 3D scenes has recently attracted much attention. However, the static environment assumption made by most current methods usually cannot be satisfied especially for real-time motion synthesis in scanned point cloud scenes, if multiple dynamic objects exist, e.g., moving persons or vehicles. To handle this problem, we propose the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene, and use it to dynamically update the latent motion for final motion synthesis. Concretely, we propose a Spherical-BEV perception method to extract local scene features that are specifically designed for instant scene-aware motion prediction. Then, we design a time-variant motion blending to fuse the new predicted motions into the latent motion, and the final motion is derived from the updated latent motions, benefitting both from motion-prior and iterative methods. We unify the data format of two prevailing datasets, PROX and GTA-IM, and take them for motion synthesis evaluation in 3D scenes. We also assess the effectiveness of the proposed method in dynamic environments from GTA-IM and Semantic3D to check the responsiveness. The results show our method outperforms previous works significantly and has great performance in handling dynamic environments.

DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 13 figures, 6 tables)

This paper contains 14 sections, 10 equations, 13 figures, 6 tables.

Introduction
Related Work
Method
Overview
Projection-based Local Scene Perception
Networks
Iterative Latent Motion Update
Experiments
Datasets
Implementation Details
Evaluation Metrics
Experimental Results
Ablation Study
Conclusion

Figures (13)

Figure 1: Illustration of proposed Dynamic Environment Motion Synthesis (DEMOS) framework based on projection-based Spherical-BEV perception. We estimate the body-centered spherical angular depth (blue spherical coordinate) and horizontal elevation map (gray mesh grid) to provide local geometry hints for instant scene-aware motion synthesis. Thus, we can iteratively generate new hypothesized motion (orange curve) and use it to update latent motion (green to yellow curve) via motion blending to adapt to the changes in scanned scene point clouds.
Figure 2: Framework of proposed Dynamic Environment Motion Synthesis (DEMOS) pipeline. (a) Human start information and the surrounding scene are taken as inputs for scene-aware motion synthesis. (b) We first sample goal position and orientation $\{\hat{t}_G,\hat{r}_G\}$ based on current information and the surrounding scene. (c)-(d) We then infer the future motion sequence $\hat{\mathcal{S}}_{T}$ consisting of route $\hat{\mathcal{R}}_{T}$ and pose $\hat{\mathcal{P}}_{T}$ consequently. (e) Later, the newly generated motion sequence $\hat{\mathcal{S}}_{T}$ is used to update the latent motion $\ddot{\mathcal{S}}_{T-1}$ to obtain new latent motion $\ddot{\mathcal{S}}_{T}$ in an iterative manner. The final synthesized long-term motion can be derived by $\tilde{\mathcal{S}}_{0:F}=[\ddot{\mathcal{S}}_{0}[0],\cdots,\ddot{\mathcal{S}}_{F}[0]]$.
Figure 3: Illustration of states with different body parts as anchors. We annotate a pose as idle when two feet are anchors, locomotion when one foot is the anchor, sitting when the gluteus is the anchored part, and lying when the back is the anchor in our setting.
Figure 4: Illustration of Spherical Angular and BEV Elevation Perception. (a) Surrounding scene points are projected on a unit sphere to estimate the depth/distance (redder means closer) to the scene point cloud at different angular directions. (b) We take a BEV centering at the human body to recognize the surrounding elevation information (red indicates high elevation while green means low elevation).
Figure 5: Frameworks of GoalNet and RouteNet. (a) GoalNet takes a CVAE structure to predict the goal $\{\hat{t}_g, \hat{r}_g\}$ based on current human body information and scene point cloud. (b) RouteNet predicts the future route $\hat{R}_{1:k}$ given the scanned scene, human start pose, and start-goal position/orientation information.
...and 8 more figures

DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

TL;DR

Abstract

DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (13)