Table of Contents
Fetching ...

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu

TL;DR

<3-5 sentence high-level summary> SceneDiffuser introduces a diffusion-based conditional model that unifies generation, optimization, and planning in 3D scenes, addressing posterior collapse and the lack of a coherent planning-into-generation framework. By integrating differentiable physics objectives into the iterative diffusion sampling and employing gradient-guided sampling, it achieves physically plausible scene-conditioned generation while facilitating long-horizon planning. The method supports diverse tasks such as human pose and motion generation, dexterous grasping, and long-range navigation and robot-arm motion planning, demonstrating strong improvements over CVAE baselines and separate planners. This unified, scene-aware approach offers a flexible, differentiable pathway for embodied perception and manipulation in complex 3D environments.

Abstract

We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

TL;DR

<3-5 sentence high-level summary> SceneDiffuser introduces a diffusion-based conditional model that unifies generation, optimization, and planning in 3D scenes, addressing posterior collapse and the lack of a coherent planning-into-generation framework. By integrating differentiable physics objectives into the iterative diffusion sampling and employing gradient-guided sampling, it achieves physically plausible scene-conditioned generation while facilitating long-horizon planning. The method supports diverse tasks such as human pose and motion generation, dexterous grasping, and long-range navigation and robot-arm motion planning, demonstrating strong improvements over CVAE baselines and separate planners. This unified, scene-aware approach offers a flexible, differentiable pathway for embodied perception and manipulation in complex 3D environments.

Abstract

We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.
Paper Structure (72 sections, 21 equations, 13 figures, 10 tables, 2 algorithms)

This paper contains 72 sections, 21 equations, 13 figures, 10 tables, 2 algorithms.

Figures (13)

  • Figure 1: Illustration of the SceneDiffuser, applicable to various scene-conditioned 3D tasks: (a) human pose generation, (b) human motion generation, (c) dexterous grasp generation, (d) path planning for 3D navigation with goals, and (e) motion planning for robot arms.
  • Figure 2: Model architecture of the SceneDiffuser. We use cross-attention to learn the relation between the input trajectory and scene condition. The optimizer and planner serve as the guidance for physically-plausible and goal-oriented trajectories.
  • Figure 3: Qualitative results of human pose generation in 3D scenes. From left to right: (a) cvae generation, (b) SceneDiffuser generation without optimization, and poses generated (c) with and (d) without applying our optimization-guided sampling.
  • Figure 4: Human motions generated by SceneDiffuser. Each row shows sampled human motions from the same start pose.
  • Figure 5: Qualitative results of dexterous grasp generation. Compared to grasps generated by cvae (first row), SceneDiffuser (second row) generates fewer colliding or floating poses, which helps to achieve a higher success rate.
  • ...and 8 more figures