Table of Contents
Fetching ...

Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie

TL;DR

This paper disentangle human-scene interaction from motion synthesis during training, and then introduces an interaction-based implicit policy into motion diffusion during inference, which presents better motion naturalness and interaction plausibility than cutting-edge methods.

Abstract

Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. Code will be publicly available at https://github.com/jingyugong/DIP.

Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

TL;DR

This paper disentangle human-scene interaction from motion synthesis during training, and then introduces an interaction-based implicit policy into motion diffusion during inference, which presents better motion naturalness and interaction plausibility than cutting-edge methods.

Abstract

Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. Code will be publicly available at https://github.com/jingyugong/DIP.

Paper Structure

This paper contains 30 sections, 21 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Policy learning frameworks. (a) Explicit policy is trained with paired motion-scene data. (b) Implicit policy optimizes the motion from random initialization. (c) Diffusion policy gradually denoise the motion. (d) Our diffusion implicit policy iteratively denoises and optimizes the motion to ensure motion naturalness, diversity, interaction plausibility without need for any paired motion-scene data.
  • Figure 2: (a) indicates the overall pipeline. Any feasible command will be decomposed into sub-tasks with action-object pairs. Then, we will synthesize future motion according to current sub-task. Last, the synthesized motions will be fused into the historical motion to obtain the final long-term motion. (b) presents the Diffusion Implicit Policy (DIP). In each iteration, the denoising step will make the synthesized motion appear more natural), and the implicit policy optimization will endow the motion with plausible interaction. The random sampling step can help the framework synthesize motion with diverse styles.
  • Figure 3: Illustration of conditional diffusion model. A diffusion model is first trained conditioned on action, and then a ControlNet branch is taken to provide keyframe joints' hint.
  • Figure 4: Visual results given by DIMOS and ours for locomotion task. The dashed circles indicate lower penetration, less skating and higher diversity in the synthesized motion.
  • Figure 5: Visual results of synthesized motions given by DIMOS and our method for sitting and lying. The dashed circles indicates obvious advantages over DIMOS in less collision (col. 1,3), higher diversity (col. 2) and better foot contact (col. 4).
  • ...and 5 more figures