Table of Contents
Fetching ...

Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow

Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, Angela Dai

Abstract

We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.

Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow

Abstract

We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.

Paper Structure

This paper contains 19 sections, 3 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Seen2Scene is the first visibility-guided flow matching approach for 3D scene completion and generation, trained directly on incomplete, real-world 3D scan data. Our approach predicts high-fidelity, geometrically complete, and structurally coherent scene geometry in unobserved regions, conditioned on observed areas and 3D box-based layouts.
  • Figure 2: Overview of Seen2Scene. We introduce visibility-guided flow matching for modeling the distribution of TSDF partial scans. (a) Partial scan TSDF patches $\mathbf{v}$ are encoded by a masked sparse VAE ($\mathcal{E}_{\tau}, \mathcal{D}_{\tau}$) into latent representations $\mathbf{z}$, masking out unknown regions unseen by the camera. (b) A sparse transformer $\mathcal{G}_\psi$ conditioned on 3D layout boxes $\mathcal{B}$ is trained with masked flow matching on surface and empty region tokens. (c) We fine-tune $\mathcal{G}_\psi$ for scan completion by injecting partial scan inputs $\mathbf{v}^{\mathrm{p}}$ via ControlNet zhang2023adding. (d)$\mathcal{G}_\psi$ can also be flexibly adapted for text or layout-conditioned 3D scene generation from scratch.
  • Figure 3: Qualitative comparison on 3D scan completion. Given real-world partial depth scans from ScanNet++ yeshwanthliu2023scannetpp and ARKitScenes arkitscenes, Seen2Scene learns realistic, high-fidelity priors from incomplete real-scan data to produce much more complete, detailed geometry than baselines.
  • Figure 4: 3D scene completion. Seen2Scene can complete large-scale scenes by generating geometry for unobserved regions across multiple chunks, guided by partial scan conditions injected into the ControlNet branch. Results are shown on ScanNet++ yeshwanthliu2023scannetpp and ARKitScenes arkitscenes.
  • Figure 5: Qualitative comparison on layout-conditioned 3D scene generation. Our method produces more geometrically detailed and semantically coherent scenes compared to BlockFusion wu2024blockfusion, LT3SD meng2025lt3sd, and WorldGrow worldgrow2025. Layouts are from ScanNet++ yeshwanthliu2023scannetpp and ARKitScenes arkitscenes.
  • ...and 10 more figures