Table of Contents
Fetching ...

CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, Jingyi Yu

TL;DR

CAST tackles single-image 3D scene reconstruction by decomposing a scene into object-centric generation and inter-object constraints. It introduces occlusion-aware ObjectGen and a diffusion-based AlignGen to generate and align per-object geometry in a canonical frame, then iteratively refines placement. A physics-aware correction stage, driven by a GPT-4v-derived scene relation graph and SDF-based optimization, enforces realistic contacts and supports across objects. The framework supports open-vocabulary scenes with high geometric fidelity, texture realism, and physically coherent interactions, enabling realistic robotics sim-to-real pipelines and editable 3D environments. Empirical results show quantitative and qualitative gains over retrieval-based and prior generation-based methods, validated by ablations and user studies.

Abstract

Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image's geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene's point cloud. Finally, CAST incorporates a physics-aware correction step that leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.

CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

TL;DR

CAST tackles single-image 3D scene reconstruction by decomposing a scene into object-centric generation and inter-object constraints. It introduces occlusion-aware ObjectGen and a diffusion-based AlignGen to generate and align per-object geometry in a canonical frame, then iteratively refines placement. A physics-aware correction stage, driven by a GPT-4v-derived scene relation graph and SDF-based optimization, enforces realistic contacts and supports across objects. The framework supports open-vocabulary scenes with high geometric fidelity, texture realism, and physically coherent interactions, enabling realistic robotics sim-to-real pipelines and editable 3D environments. Empirical results show quantitative and qualitative gains over retrieval-based and prior generation-based methods, validated by ablations and user studies.

Abstract

Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image's geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene's point cloud. Finally, CAST incorporates a physics-aware correction step that leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.

Paper Structure

This paper contains 37 sections, 11 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of the proposed pipeline. The input RGB image is processed through scene analysis to extract key information, followed by pose-aware generation to create initial 3D models. Physical constraint refinement ensures realistic interactions and spatial relations, yielding a high-quality, mesh-based 3D scene.
  • Figure 2: Network design of our alignment generation model (Sec. \ref{['sec:transformationgen']}), occlusion-aware object generation model (Sec. \ref{['sec:objectgen']}), and an illustrative figure of the texture generation model.
  • Figure 3: Physics-aware correction via constraint graph mapped from fine-grained relation graph. Top: Floating surfboard grounded on the van. Bottom: Penetrating guitar and cooler separated.
  • Figure 4: Bringing the vibrant diversity of the real world into the virtual realm, this collection reimagines open-vocabulary scenes as immersive digital environments, capturing the richness and depth of each unique setting. For each scene, the images display as follows: the top-left shows the input image, the top-center displays the rendered geometry, and the right presents the rendered image with realistic textures.
  • Figure 5: Qualitative comparisons of CAST with state-of-the-art single-image scene reconstruction methods. From left to right: Input image, CAST, ACDC, and Gen3DSR. Top to bottom: random open vocabulary dataset (rows 1–3), Gen3DSR input (rows 4–5), ACDC input (rows 6–7).
  • ...and 6 more figures