Table of Contents
Fetching ...

Diffusion Models in 3D Vision: A Survey

Zhen Wang, Dongyuan Li, Yaozu Wu, Tianyu He, Jiang Bian, Renhe Jiang

TL;DR

This survey surveys the integration of diffusion models into 3D vision, addressing forward noising, reverse denoising, and score-based perspectives to model complex 3D data distributions. It details how DDPMs, SGMs, and SDEs are adapted to 3D representations for tasks including unconditional generation, image-to-3D, text-to-3D, texture, avatars, scenes, editing, novel view synthesis, and depth estimation. It also catalogs widely used 3D datasets and evaluation metrics, and articulates limitations—computational demands, multimodal fusion, and data scarcity—while proposing directions such as efficient inference, cross-modal integration, large-scale pretraining, and physics-informed constraints. The paper aims to guide researchers and engineers by providing a structured understanding of diffusion-based 3D generation and a roadmap for future improvements with real-world impact in autonomous driving, robotics, AR/VR, and healthcare.

Abstract

In recent years, 3D vision has become a crucial field within computer vision, powering a wide range of applications such as autonomous driving, robotics, augmented reality, and medical imaging. This field relies on accurate perception, understanding, and reconstruction of 3D scenes from 2D images or text data sources. Diffusion models, originally designed for 2D generative tasks, offer the potential for more flexible, probabilistic methods that can better capture the variability and uncertainty present in real-world 3D data. In this paper, we review the state-of-the-art methods that use diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point-cloud reconstruction, and scene construction. We provide an in-depth discussion of the underlying mathematical principles of diffusion models, outlining their forward and reverse processes, as well as the various architectural advancements that enable these models to work with 3D datasets. We also discuss the key challenges in applying diffusion models to 3D vision, such as handling occlusions and varying point densities, and the computational demands of high-dimensional data. Finally, we discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks. This paper serves as a foundation for future exploration and development in this rapidly evolving field.

Diffusion Models in 3D Vision: A Survey

TL;DR

This survey surveys the integration of diffusion models into 3D vision, addressing forward noising, reverse denoising, and score-based perspectives to model complex 3D data distributions. It details how DDPMs, SGMs, and SDEs are adapted to 3D representations for tasks including unconditional generation, image-to-3D, text-to-3D, texture, avatars, scenes, editing, novel view synthesis, and depth estimation. It also catalogs widely used 3D datasets and evaluation metrics, and articulates limitations—computational demands, multimodal fusion, and data scarcity—while proposing directions such as efficient inference, cross-modal integration, large-scale pretraining, and physics-informed constraints. The paper aims to guide researchers and engineers by providing a structured understanding of diffusion-based 3D generation and a roadmap for future improvements with real-world impact in autonomous driving, robotics, AR/VR, and healthcare.

Abstract

In recent years, 3D vision has become a crucial field within computer vision, powering a wide range of applications such as autonomous driving, robotics, augmented reality, and medical imaging. This field relies on accurate perception, understanding, and reconstruction of 3D scenes from 2D images or text data sources. Diffusion models, originally designed for 2D generative tasks, offer the potential for more flexible, probabilistic methods that can better capture the variability and uncertainty present in real-world 3D data. In this paper, we review the state-of-the-art methods that use diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point-cloud reconstruction, and scene construction. We provide an in-depth discussion of the underlying mathematical principles of diffusion models, outlining their forward and reverse processes, as well as the various architectural advancements that enable these models to work with 3D datasets. We also discuss the key challenges in applying diffusion models to 3D vision, such as handling occlusions and varying point densities, and the computational demands of high-dimensional data. Finally, we discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks. This paper serves as a foundation for future exploration and development in this rapidly evolving field.
Paper Structure (36 sections, 10 equations, 14 figures, 2 tables)

This paper contains 36 sections, 10 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: The overall framework of this survey.
  • Figure 2: Diffusion models smoothly perturb data by adding noise, then reverse this process to generate new data from noise. Each denoising step in the reverse process typically requires estimating the score function yang2023diffusion.
  • Figure 3: The directed graphical model considered in denoising diffusion probabilistic models ddpm.
  • Figure 4: Solving a reversetime stochastic differential equations yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE score_sde.
  • Figure 5: Taxonomy of 3D Diffusion Tasks. .
  • ...and 9 more figures