Table of Contents
Fetching ...

GenRC: Generative 3D Room Completion from Sparse Image Collections

Ming-Feng Li, Yueh-Feng Ku, Hong-Xuan Yen, Chi Liu, Yu-Lun Liu, Albert Y. C. Chen, Cheng-Hao Kuo, Min Sun

TL;DR

GenRC tackles room-scale 3D scene completion from sparse RGBD without training by generating a cross-view consistent panorama with an equirectangular diffusion process and depth, guided by textual inversion to maintain style. It introduces E-Diffusion for cross-view panorama inpainting, a textual inversion-based prompt mechanism, and an active sampling strategy to pick panoramas that align with input geometry, enabling training-free completion. The method is evaluated on ScanNet and ARKitScenes, where GenRC outperforms baselines on color and geometric metrics, especially with sparse inputs and in cross-domain settings. The results demonstrate that high-fidelity, texture-coherent indoor scenes can be produced without dataset-specific training or predefined camera trajectories, broadening practical deployment.

Abstract

Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories. Project page: https://minfenli.github.io/GenRC

GenRC: Generative 3D Room Completion from Sparse Image Collections

TL;DR

GenRC tackles room-scale 3D scene completion from sparse RGBD without training by generating a cross-view consistent panorama with an equirectangular diffusion process and depth, guided by textual inversion to maintain style. It introduces E-Diffusion for cross-view panorama inpainting, a textual inversion-based prompt mechanism, and an active sampling strategy to pick panoramas that align with input geometry, enabling training-free completion. The method is evaluated on ScanNet and ARKitScenes, where GenRC outperforms baselines on color and geometric metrics, especially with sparse inputs and in cross-domain settings. The results demonstrate that high-fidelity, texture-coherent indoor scenes can be produced without dataset-specific training or predefined camera trajectories, broadening practical deployment.

Abstract

Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories. Project page: https://minfenli.github.io/GenRC
Paper Structure (37 sections, 4 equations, 12 figures, 7 tables)

This paper contains 37 sections, 4 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Scene-level 3D mesh generation. GenRC (the blue path) directly generates a cross-view consistent panorama to complete the main portion of a scene, unlike the iterative methods (the green path) demonstrated in lei2023rgbd2hollein2023text2room which require designed camera trajectories. GenRC can produce a comprehensive room-scale mesh with high-fidelity texture, even when provided with sparse RGBD observations. Compared with the previous method RGBD2 lei2023rgbd2, GenRC excels in generating more complete meshes and high-fidelity images.
  • Figure 2: Pipeline of GenRC: (a) Firstly, we extract text embeddings as a token to represent the style of provided RGBD images via textual inversion. Next, we project these images to a 3D mesh. (b) Following that, we render a panorama from a plausible room center and use equirectangular projection to render various viewpoints of the scene from the panoramic image. Then, we propose E-Diffusion that satisfies equirectangular geometry to concurrently denoise these images and determine their depth via monocular depth estimation, resulting in a cross-view consistent panoramic RGBD image. (c) Lastly, we sample novel views from the mesh to fill in the remaining holes.
  • Figure 3: Multi-view diffusion with equirectangular geometry. (a) Given an incomplete panoramic image, we first obtain several incomplete perspective images via equirectangular projection. (b) To denoise a perspective image at $i$-th view for one step, we first denoise all images to clean images and warp all the images to $i$-th view to get an averaged image. Then, we add random noise back to the averaged image to get a perspective image which is denoised for one step. Note that while we use images in RGB space here for illustration, the entire process is operated in latent space.
  • Figure 4: Comparision of methods for panorama generation. We crop two regions on each panorama and project them to perspective views (the red blocks above). (a) MultiDiffusion bar2023multidiffusion can produce a high-resolution image. However, it doesn't satisfy the geometry of equirectangular projection (e.g., the straight lines on the ceiling in the panorama transforming into unrealistic curves in the perspective view). (b) Our proposed E-Diffusion (\ref{['sec:multi-view diffusion']}) can generate a panorama that preserves the equirectangular geometry. But without Texture Refinement (TR), the result looks blurry. (c) Applying the last 20 denoising steps for Texture Refinement (TR), our approach achieves the generation of a high-fidelity and high-resolution panorama that adheres to equirectangular geometry.
  • Figure 5: Active Sampling. Given an initialized mesh from two input views as shown in (a), we try to complete the mesh by inpainting the rendered panorama at the room center (yellow and green dots in (b) and (c)). However, input camera views are sometimes blocked by the mesh inpainted from an unreasonable panorama, as shown in (b). To address this issue, our active sampling strategy samples multiple panoramas as candidates and calculates their mean square errors of depth (\ref{['eq:mse-active-sampling']}) with all input depth maps to pick the best panorama. This strategy prevents us from using bad panoramas that occlude the given camera views.
  • ...and 7 more figures