Table of Contents
Fetching ...

Enhancing Monocular 3D Scene Completion with Diffusion Model

Changlin Song, Jiaqi Wang, Liyun Zhu, He Weng

TL;DR

FlashDreamer addresses monocular 3D scene reconstruction by completing a single image into a full 3D scene. It combines a pre-trained Flash3D-based 3D Gaussian Splatting with diffusion-based inpainting guided by a Vision-Language Model-generated prompt to synthesize multi-view images and merge them into a coherent 3D representation, without additional training. The method iteratively renders new viewpoints at predefined angles, inpaints unseen regions, and merges the results with alignment masks and a loss that enforces consistency with the intermediate 3DGS, using $T_i$, $R_i$, and $M$ to denote transforms, representations, and masks. Evaluations on a Replica subset demonstrate improved FID and CLIP scores over PixelSynth and show robust, view-consistent completion across various rotation angles, highlighting the practical potential for monocular reconstruction in VR/robotics and autonomous driving.

Abstract

3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving, enabling machines to understand and interact with complex environments. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance, but this dependence limits their use in scenarios where only a single image is available. In this work, we introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image, significantly reducing the need for multi-view inputs. Our approach leverages a pre-trained vision-language model to generate descriptive prompts for the scene, guiding a diffusion model to produce images from various perspectives, which are then fused to form a cohesive 3D reconstruction. Extensive experiments show that our method effectively and robustly expands single-image inputs into a comprehensive 3D scene, extending monocular 3D reconstruction capabilities without further training. Our code is available https://github.com/CharlieSong1999/FlashDreamer/tree/main.

Enhancing Monocular 3D Scene Completion with Diffusion Model

TL;DR

FlashDreamer addresses monocular 3D scene reconstruction by completing a single image into a full 3D scene. It combines a pre-trained Flash3D-based 3D Gaussian Splatting with diffusion-based inpainting guided by a Vision-Language Model-generated prompt to synthesize multi-view images and merge them into a coherent 3D representation, without additional training. The method iteratively renders new viewpoints at predefined angles, inpaints unseen regions, and merges the results with alignment masks and a loss that enforces consistency with the intermediate 3DGS, using , , and to denote transforms, representations, and masks. Evaluations on a Replica subset demonstrate improved FID and CLIP scores over PixelSynth and show robust, view-consistent completion across various rotation angles, highlighting the practical potential for monocular reconstruction in VR/robotics and autonomous driving.

Abstract

3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving, enabling machines to understand and interact with complex environments. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance, but this dependence limits their use in scenarios where only a single image is available. In this work, we introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image, significantly reducing the need for multi-view inputs. Our approach leverages a pre-trained vision-language model to generate descriptive prompts for the scene, guiding a diffusion model to produce images from various perspectives, which are then fused to form a cohesive 3D reconstruction. Extensive experiments show that our method effectively and robustly expands single-image inputs into a comprehensive 3D scene, extending monocular 3D reconstruction capabilities without further training. Our code is available https://github.com/CharlieSong1999/FlashDreamer/tree/main.

Paper Structure

This paper contains 6 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Motivation of our FlashDreamer: Given a single input image, our method reconstructs a more complete 3D scene without requiring additional images from multiple viewpoints.
  • Figure 2: The pipeline of our FlashDreamer. Our model receives the initial image ($I_0$) as the input. A text prompt ($t_0$) for the diffusion inference will be generated by a pre-trained Vision-Language Model (VLM) which sees the $I_0$. The initial 3DGS ($\mathcal{R}_0$) will be generated by Flash3D, and rendered with the first pre-defined gemeotry transform $T_1$ and viewing angle $g_1$ to get the incomplete image ($I'_1 = T_1(\mathcal{R}_0, g_1)$). The diffusion model will complete $I'_1$ with inpainting, combined with the prompt $t_0$, to get $I_1$. With $I_1$ ($I_1 = Diffusion(I'_1, t_0)$), a new 3DGS $\mathcal{R}'_1$ will be generated by Flash3D, and merged with $\mathcal{R}_0$ to get the completed 3DGS ($\mathcal{R}_1$). This sequential loop will iterate $N$ times for all the pre-defined geometric transform $\{T_i\}_{i=1}^{N}$. The final 3DGS ($\mathcal{R}_N$) will be the output.
  • Figure 3: The pipeline of Flash3D szymanowicz2024flash3d. Flash3D uses a ResNet encoder that extracts features from both the RGB image and its depth map estimated with a pre-trained monocular depth estimation model. They are subsequently processed by two decoders, which together output all 3D Gaussian parameters eventually. Image reproduced from szymanowicz2024flash3d.
  • Figure 4: Comparison of different rotation angle increments. The top row presents original images rendered from various perspectives. The middle row demonstrates image rotations in 10° increments, spanning from -30° to 30°. The bottom row further refines this with 5° rotation increments over the same range. While smaller rotation increments provide finer adjustments, they result in overlapping edges, which may degrade inpainting quality by introducing artifacts in the boundary regions.
  • Figure 5: Comparison of diffusion models. Images generated with identical prompts and using Stable Diffusion-v2 Rombach_2022_CVPR and Stable Diffusion-xl podell2023sdxl. Stable Diffusion-v2 yields more realistic outputs, while Stable Diffusion-xl shows inconsistencies and artifacts.
  • ...and 3 more figures