Table of Contents
Fetching ...

Integrating Meshes and 3D Gaussians for Indoor Scene Reconstruction with SAM Mask Guidance

Jiyeop Kim, Jongwoo Lim

TL;DR

The paper tackles indoor scene reconstruction by marrying room-layout meshes with 3D Gaussians for objects, addressing joint-training ambiguity via Segment Anything Model (SAM) masks to assign each instance to a single primitive. It introduces a SAM-based mask loss and an additional densification stage to stabilize training and improve rendering quality, while enabling easy room-layout editing by decoupling primitives. Evaluations on the Replica dataset show improved instance separation (GIoU/LIoU) and competitive image quality, with demonstrable editing capabilities for room layouts. By combining explicit meshes for layout with fast Gaussians for objects, the approach offers a flexible, editable, and efficient framework for indoor scene reconstruction and potential extensions to larger or outdoor scenes.

Abstract

We present a novel approach for 3D indoor scene reconstruction that combines 3D Gaussian Splatting (3DGS) with mesh representations. We use meshes for the room layout of the indoor scene, such as walls, ceilings, and floors, while employing 3D Gaussians for other objects. This hybrid approach leverages the strengths of both representations, offering enhanced flexibility and ease of editing. However, joint training of meshes and 3D Gaussians is challenging because it is not clear which primitive should affect which part of the rendered image. Objects close to the room layout often struggle during training, particularly when the room layout is textureless, which can lead to incorrect optimizations and unnecessary 3D Gaussians. To overcome these challenges, we employ Segment Anything Model (SAM) to guide the selection of primitives. The SAM mask loss enforces each instance to be represented by either Gaussians or meshes, ensuring clear separation and stable training. Furthermore, we introduce an additional densification stage without resetting the opacity after the standard densification. This stage mitigates the degradation of image quality caused by a limited number of 3D Gaussians after the standard densification.

Integrating Meshes and 3D Gaussians for Indoor Scene Reconstruction with SAM Mask Guidance

TL;DR

The paper tackles indoor scene reconstruction by marrying room-layout meshes with 3D Gaussians for objects, addressing joint-training ambiguity via Segment Anything Model (SAM) masks to assign each instance to a single primitive. It introduces a SAM-based mask loss and an additional densification stage to stabilize training and improve rendering quality, while enabling easy room-layout editing by decoupling primitives. Evaluations on the Replica dataset show improved instance separation (GIoU/LIoU) and competitive image quality, with demonstrable editing capabilities for room layouts. By combining explicit meshes for layout with fast Gaussians for objects, the approach offers a flexible, editable, and efficient framework for indoor scene reconstruction and potential extensions to larger or outdoor scenes.

Abstract

We present a novel approach for 3D indoor scene reconstruction that combines 3D Gaussian Splatting (3DGS) with mesh representations. We use meshes for the room layout of the indoor scene, such as walls, ceilings, and floors, while employing 3D Gaussians for other objects. This hybrid approach leverages the strengths of both representations, offering enhanced flexibility and ease of editing. However, joint training of meshes and 3D Gaussians is challenging because it is not clear which primitive should affect which part of the rendered image. Objects close to the room layout often struggle during training, particularly when the room layout is textureless, which can lead to incorrect optimizations and unnecessary 3D Gaussians. To overcome these challenges, we employ Segment Anything Model (SAM) to guide the selection of primitives. The SAM mask loss enforces each instance to be represented by either Gaussians or meshes, ensuring clear separation and stable training. Furthermore, we introduce an additional densification stage without resetting the opacity after the standard densification. This stage mitigates the degradation of image quality caused by a limited number of 3D Gaussians after the standard densification.
Paper Structure (17 sections, 9 equations, 10 figures, 3 tables)

This paper contains 17 sections, 9 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We propose a novel approach that integrates meshes into 3D Gaussian Splatting kerbl20233d to represent the room layout of indoor scenes with meshes. Naively training 3D Gaussians and room layout meshes jointly can result in ambiguity issues, leading to a single instance being represented by both primitives. We address this issue by using SAM masks as a guide. Since the scene is represented as the sum of two primitives, we can selectively render or edit one of the primitives.
  • Figure 2: Overview. Room layout meshes and 3D Gaussians are rendered using a mesh renderer and a Gaussian renderer, respectively. The outputs of the two renderers are blended to obtain the final rendered image $\hat{I}$, which is used to calculate the color loss $\mathcal{L}_\textrm{color}$ (\ref{['eq:color-loss']}) with the GT image $I$. Using the opacity map $O$ obtained from the Gaussian renderer and the SAM masks $\{m_i\}_{i=1}^M$ of the GT image $I$, the average opacity for each mask $\{o_i\}_{i=1}^M$ can be obtained using \ref{['eq:opacity']}. $\{o_i\}_{i=1}^M$ is then used to calculate $\mathcal{L}_\textrm{mask}$ as \ref{['eq:mask-loss2']}.
  • Figure 3: Results of rendering only 3D Gaussians after naive joint training. (a) Joint training encounters optimization difficulties for the room layout and surrounding objects due to the ambiguity problem. (b) Nevertheless, the average opacity for each SAM mask serves as a suitable criterion for determining which primitive should represent each instance. The number in the bottom right corner is the average opacity for each mask.
  • Figure 4: The weights applied to each mask based on the different weighting scheme. Applying a constant weight amplifies the influence of smaller instance masks, while using area-proportional weights amplifies larger masks more than smaller ones.
  • Figure 5: Qualitative results on several indoor scenes of Replica dataset straub2019replica. The images in the second column are rendered only by the room layout meshes, and those in the third column are rendered only by the 3D Gaussians. The last column shows the combined images.
  • ...and 5 more figures