Table of Contents
Fetching ...

Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal

Rio Aguina-Kang, Kevin James Blackburn-Matzen, Thibault Groueix, Vladimir Kim, Matheus Gadelha

TL;DR

SeeingThroughClutter addresses the challenge of reconstructing structured 3D scenes from a single image in cluttered settings. It introduces a training-free two-stage pipeline where a vision-language model acts as an orchestrator to iteratively remove foreground objects, segment amodally, and inpaint, followed by depth-guided layout optimization that fuses per-object meshes into a coherent scene. Key contributions include the VLM-driven object-removal framework, a depth-alignment refinement across multiple views using a coordinate-based MLP $f_{\theta_n}$, and a robust object-fitting pipeline based on a two-stage registration using $Sim(3)$ transforms and depth cues, achieving state-of-the-art results on 3D-Front and ADE20K. The approach enables editable, occlusion-resilient 3D reconstructions without task-specific training and scales across indoor/outdoor clutter through reliance on off-the-shelf foundation models.

Abstract

We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/

Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal

TL;DR

SeeingThroughClutter addresses the challenge of reconstructing structured 3D scenes from a single image in cluttered settings. It introduces a training-free two-stage pipeline where a vision-language model acts as an orchestrator to iteratively remove foreground objects, segment amodally, and inpaint, followed by depth-guided layout optimization that fuses per-object meshes into a coherent scene. Key contributions include the VLM-driven object-removal framework, a depth-alignment refinement across multiple views using a coordinate-based MLP , and a robust object-fitting pipeline based on a two-stage registration using transforms and depth cues, achieving state-of-the-art results on 3D-Front and ADE20K. The approach enables editable, occlusion-resilient 3D reconstructions without task-specific training and scales across indoor/outdoor clutter through reliance on off-the-shelf foundation models.

Abstract

We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/
Paper Structure (14 sections, 2 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 2 equations, 8 figures, 2 tables, 3 algorithms.

Figures (8)

  • Figure 1: Single-view scene reconstruction is often challenging due to the complexity and clutter of real-world environments. Consider a scenario where the goal is to reconstruct a table within a scene. In Figure \ref{['fig:declutter_example']}A, we localize and segment the table using a bounding box; however, the resulting mask is noisy and unclear, largely because various items on the tabletop—and other objects like chairs—fall inside the box. In contrast, Figure \ref{['fig:declutter_example']}B shows the same scene after these extraneous objects have been removed from the visible region, yielding a much cleaner and more accurate segmentation and reconstruction.
  • Figure 2: Overview. Our method consists of two stages. In the first, iterative object-removal stage, we use custom VLM prompting to identify the next best candidate for removal. Based on the predicted object name, a segmentation module computes its mask, and an inpainting module fills in the background. In the second, layout-optimization stage, we reconstruct a complete mesh for each removed object using the sequence of images and their corresponding masks. To place the objects into a shared scene, we first apply monocular depth estimation, then perform depth-map refinement to align all independently predicted depth maps. Finally, we translate and scale each object to fit the unified point cloud, producing the final 3D scene.
  • Figure 3: Depth refinement. When each object-removed image is passed independently through a standard depth estimator, the resulting depth maps do not align. To reconcile these discrepancies, we introduce a depth‐refinement optimization that jointly adjusts the estimated depths into a single, coherent representation. Left: Point clouds reconstructed from the raw depth estimates, color‐coded by their source image. Right: the same scene after applying our alignment procedure. Notice how the optimization brings all point clouds into tight agreement.
  • Figure 4: Qualitative comparison on ADE20K segmentations. From left to right: input, ground truth, with iterative object removal, and without iterative object removal.
  • Figure 5: Text-to-3D Scene pipeline. We show that our method can be applied to images generated by text-to-image models sdxl. This corresponds to a fully automated text-to-3D scene pipeline.
  • ...and 3 more figures