Table of Contents
Fetching ...

CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration

Boshi Tang, Henry Zheng, Rui Huang, Gao Huang

TL;DR

CC-FMO presents a fully foundation-model–driven approach for zero-shot single-image to 3D scene generation, introducing a hybrid VecSet+SLAT object generator to preserve semantics while achieving high geometric fidelity. It pairs this with a camera-conditioned, scale-aware pose-estimation module that employs a closed-form solution and normal-map texturing to leverage FoundationPose effectively. The pipeline operates entirely in a zero-shot regime, demonstrating superior scene- and object-level fidelity and strong generalization across camera intrinsics compared to training-dependent baselines. The work highlights the importance of tightly integrating preprocessing, semantic-aware object generation, and metric-scale pose estimation to enable metrically calibrated, camera-aligned 3D scenes from a single RGB image. Overall, CC-FMO offers a practical, data-free path toward robust compositional scene generation with foundation-model orchestration, with potential impact on AR/VR and embodied AI applications.

Abstract

High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.

CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration

TL;DR

CC-FMO presents a fully foundation-model–driven approach for zero-shot single-image to 3D scene generation, introducing a hybrid VecSet+SLAT object generator to preserve semantics while achieving high geometric fidelity. It pairs this with a camera-conditioned, scale-aware pose-estimation module that employs a closed-form solution and normal-map texturing to leverage FoundationPose effectively. The pipeline operates entirely in a zero-shot regime, demonstrating superior scene- and object-level fidelity and strong generalization across camera intrinsics compared to training-dependent baselines. The work highlights the importance of tightly integrating preprocessing, semantic-aware object generation, and metric-scale pose estimation to enable metrically calibrated, camera-aligned 3D scenes from a single RGB image. Overall, CC-FMO offers a practical, data-free path toward robust compositional scene generation with foundation-model orchestration, with potential impact on AR/VR and embodied AI applications.

Abstract

High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.

Paper Structure

This paper contains 31 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: CC-FMO generates high-fidelity 3D scenes from single images in a zero-shot manner by addressing key challenges in applying 3D foundation models for scene generation. It combines (1) a hybrid instance generator to preserve instance semantics and geometric fidelity and (2) a foundation pose estimation model augmented with a novel metric-scale solving algorithm.
  • Figure 2: Illustration of CC-FMO, a zero-shot framework built entirely on foundation models for single-image to 3D scene generation. Our method delivers semantically accurate, high-fidelity geometry alongside precise per-object pose estimation. We first extract clean object instances through occlusion-aware segmentation and inpainting, providing reliable inputs for 3D generation. Next, we synergize a hybrid 3D foundation generation approach that melds vector latent-set (VecSet) and structured-latent (SLAT) models to synthesize high-quality, semantically consistent instances. Finally, to ensure coherent scene composition, we orchestrate a scale-solving algorithm with a foundational pose estimation model to resolve scale ambiguity and align generated objects with the input's spatial layout in a camera-conditioned manner.
  • Figure 3: Qualitative results. CC-FMO demonstrates strong performance in this challenging scenario. Compared to baseline methods, our model produces meshes with superior geometric fidelity while maintaining object layouts that closely align with the input images. Refer to our supplementary materials for more qualitative comparisons.
  • Figure 4: Ablation visualization. Removing the structured latent component (SLAT) leads to degraded geometric quality, while omitting the vector set (VecSet) results in reduced semantic accuracy. Our instance generator effectively preserves semantic accuracy and simultaneously produces detailed geometry.
  • Figure 5: Qualitative comparison between CC-FMO and MIDI. Note that the segmentation masks are the same.