Table of Contents
Fetching ...

Digital Twin Generation from Visual Data: A Survey

Andrew Melnik, Benjamin Alt, Giang Nguyen, Artur Wilkowski, Maciej Stefańczyk, Qirui Wu, Sinan Harms, Helge Rhodin, Manolis Savva, Michael Beetz

TL;DR

This survey analyzes how to generate immersive indoor Digital Twins from visual data, covering geometry representations (Mesh, CAD, 3DGS) and the end-to-end pipelines from video to 3D reconstructions. It surveys traditional 2D-to-3D workflows (COLMAP SfM, SLAM) and modern 3DGS-based methods, including single-image and sparse-view strategies, with emphasis on handling input variety (monocular, non-calibrated, fisheye, LiDAR/RGB-D) and diffusion-guided domain-free scene generation. A core thread is the temporal dimension, comparing implicit neural, voxel-based, and planar-factorized approaches to model dynamic scenes, alongside regularization and efficiency considerations for robotics applications. The paper also covers lighting, reflections, articulated objects, and physics integration, highlighting scene description formats (URDF, USD) and simulators (MuJoCo, PyBullet, Gazebo, Unreal) for physics-based DTs. Collectively, it identifies practical pathways and research directions for scalable, physically plausible, and semantically rich digital twins driven by visual data, with future opportunities in per-room lighting, diffusion-guided 3D generation, and integrated physics simulation.

Abstract

This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins

Digital Twin Generation from Visual Data: A Survey

TL;DR

This survey analyzes how to generate immersive indoor Digital Twins from visual data, covering geometry representations (Mesh, CAD, 3DGS) and the end-to-end pipelines from video to 3D reconstructions. It surveys traditional 2D-to-3D workflows (COLMAP SfM, SLAM) and modern 3DGS-based methods, including single-image and sparse-view strategies, with emphasis on handling input variety (monocular, non-calibrated, fisheye, LiDAR/RGB-D) and diffusion-guided domain-free scene generation. A core thread is the temporal dimension, comparing implicit neural, voxel-based, and planar-factorized approaches to model dynamic scenes, alongside regularization and efficiency considerations for robotics applications. The paper also covers lighting, reflections, articulated objects, and physics integration, highlighting scene description formats (URDF, USD) and simulators (MuJoCo, PyBullet, Gazebo, Unreal) for physics-based DTs. Collectively, it identifies practical pathways and research directions for scalable, physically plausible, and semantically rich digital twins driven by visual data, with future opportunities in per-room lighting, diffusion-guided 3D generation, and integrated physics simulation.

Abstract

This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins

Paper Structure

This paper contains 43 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Generating Indoor Digital Twins from Visual Data.
  • Figure 2: Visualization of key principles in 3D Gaussian Splatting pipeline (a). Gaussians are initialized from a sparse point cloud (b), for fast rendering the image is split into tiles using differentiable rasterization (c). Projected Gaussians inside a tile's view frustum are sorted by depth (d), this allows $\alpha$-blending to determine the final color of each Gaussian (e). During optimization, adaptive densification controls the number of Gaussians to minimize reconstruction errors (f). View-dependency of color can lead to inconsistency when rendering from different views, flattening the z-scale can improve consistency (g). Figure compiled from (a)-Kerbl2023;(b)-Rahman2023; (c),(e)-Han2025;(d)-Yurkova2023;(f)-Kerbl2023;(g)-Huang2024a
  • Figure 3: Generation of mesh models. Image from chen2024meshanything
  • Figure 4: Summary on 3D shape retreival paradigm.
  • Figure 5: Example rendering pipeline for Gaussian splatting with reflections ye20243DGaussianSplatting
  • ...and 4 more figures