Table of Contents
Fetching ...

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, Jean Oh

TL;DR

A novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense and outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments.

Abstract

We introduce a diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a multi-channel inference-time optimization that avoids error accumulation common in sequential or single-room constraints in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising process when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Resources and code are at https://mosaic-cmubig.github.io

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

TL;DR

A novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense and outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments.

Abstract

We introduce a diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a multi-channel inference-time optimization that avoids error accumulation common in sequential or single-room constraints in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising process when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Resources and code are at https://mosaic-cmubig.github.io

Paper Structure

This paper contains 15 sections, 21 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: For privacy-preserving scenarios where RGB collection is restricted, MOSAIC generates consistent RGB images from depth data captured along robot paths, guided by text prompts. These outputs further enable 3D reconstruction of multi-room environments.
  • Figure 2: MOSAIC overview. (a) Multi-channel denoising. Each depth–text–conditioned view is assigned its own latent channel. A shared denoiser iteratively refines the latent set while a multi-channel inference-time optimizer keeps the channels synchronized. (b) Projection loss. At every step the predicted clean latents $z_{0}$ guided depth-weighted projection loss $L_{\text{proj}}$ drives the channels toward a geometry-consistent solution. (c) Pixel-space refinement. During the final denoising stages, the pixel-level loss $L_{\text{pixel}}$ fuses the views and enforces RGB consistency, yielding photorealistic, cross-view-aligned images that can be reconstructed into a coherent 3-D scene.
  • Figure 3: Qualitative comparison with multi-view baselines. For three indoor scenes (each conditioned on a style prompt) we show: input depth maps, ground-truth RGB, our MOSAIC result, and two baselines (MVDiffusion, Warp-and-Inpainting) sharing the identity input format. Below, we compare against baselines (SceneTex, SceneCraft, Text2Room) using their native inputs. MOSAIC maintains photorealism, cross-view consistency, and prompt fidelity, whereas competing methods exhibit blur, style drift, or geometric artifacts.
  • Figure 4: Scene-level reconstruction. Fusing the multi-view images generated by MOSAIC with a standard TSDF pipeline produces coherent, textured meshes across diverse indoor rooms.
  • Figure 5: Qualitative ablation. Columns show matched views for two scenes; rows compare naïve averaging, MOSAIC without $L_{\text{pixel}}$, and full MOSAIC. Boxes indicate identity objects across different viewpoints along generation. Red: blur/ghosts artifacts; orange: inconsistent texture drift; green: full model corrects both.
  • ...and 1 more figures