Table of Contents
Fetching ...

Object-Aware 4D Human Motion Generation

Shurui Gui, Deep Anil Patel, Xiner Li, Martin Renqiang Min

TL;DR

This work tackles object-aware 4D human motion generation by infusing explicit 3D priors and diffusion-based motion knowledge into a zero-shot framework. It introduces Motion Score Distilled Interaction (MSDI), which uses 3D Gaussian representations for humans and objects, LLM-driven spatial guidance for coarse trajectories, and Motion Diffusion Score Distillation Sampling (MSDS) to refine motions under explicit trajectory, smoothness, and collision constraints. The approach demonstrates superior motion realism, diversity, and physical plausibility compared with 4Dfy across quantitative metrics and user studies, while remaining generalizable to unseen object interactions without retraining. The combination of 3D priors, diffusion priors, and language-guided planning offers a scalable path toward realistic, interactive 4D content creation with broad practical impact for synthetic data, animation, and VR/AR applications.

Abstract

Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

Object-Aware 4D Human Motion Generation

TL;DR

This work tackles object-aware 4D human motion generation by infusing explicit 3D priors and diffusion-based motion knowledge into a zero-shot framework. It introduces Motion Score Distilled Interaction (MSDI), which uses 3D Gaussian representations for humans and objects, LLM-driven spatial guidance for coarse trajectories, and Motion Diffusion Score Distillation Sampling (MSDS) to refine motions under explicit trajectory, smoothness, and collision constraints. The approach demonstrates superior motion realism, diversity, and physical plausibility compared with 4Dfy across quantitative metrics and user studies, while remaining generalizable to unseen object interactions without retraining. The combination of 3D priors, diffusion priors, and language-guided planning offers a scalable path toward realistic, interactive 4D content creation with broad practical impact for synthetic data, animation, and VR/AR applications.

Abstract

Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

Paper Structure

This paper contains 25 sections, 12 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Method overview. The framework includes 4 components: human and object 3D generation, coarse trajectory generation, constrained motion optimization, and rendering.
  • Figure 2: Qualitative Results. Generated videos from 4Dfy and MSDI across various text prompts. Each row corresponds to a different prompt. Within each row, columns display frames sampled at incremental timesteps from the generated video, illustrating temporal progression and motion characteristics. The frames are center cropped for better visibility.
  • Figure 3: Quantitative Results. Quantitative comparison of MSDI and 4Dfy. The bar chart displays scores for 4 key metrics across 10 text prompts. An 'X' marker indicates that the metric failed to detect any humans in all four generated views for that particular prompt.
  • Figure 4: Ablation study on key components of MSDI. We visualize the impact of removing our main loss terms for the prompt "the human jumps onto the table".
  • Figure 5: Video Language Score Comparison of MSDI and 4Dfy.
  • ...and 5 more figures