Table of Contents
Fetching ...

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie

Abstract

Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

Abstract

Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

Paper Structure

This paper contains 22 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Performance overview. 3D-Fixer extends pre-trained image-to-3D generative priors to achieve compositional 3D scene generation through a novel in-place completion paradigm. (a) Our method significantly outperforms baselines such as Gen3DSR and MIDI in geometry quality. (b) It further demonstrates strong generalization to complex real-world and outdoor scenes.
  • Figure 2: Architecture of the 3D-Fixer pipeline and dataset. (Top) Scene Decomposition extracts instance-level partial geometry from the input. (Bottom-left) Progressive Completion generates the full asset via three stages: 1) The Coarse Structure Completer hallucinates topology within a loose bound; 2) The Fine Shape Refiner sharpens geometry within a fine boundary; and 3) The Occlusion-Aware 3D Texturer applies observation-aligned textures. (Bottom-right) Our ARSG-110K Dataset provides high-quality assets and rich scene compositions for training.
  • Figure 3: 3D-Fixer extends the diffusion transformer from prior xiang2025structured (orange) to a dual-stream architecture (blue), where a trainable branch encoding scene-specific geometric cues interacts with a frozen generative branch to implement ORFA and enforce structural constraints.
  • Figure 4: Visualization of the results on the Gen3DSR testset and our testset. The results on the Gen3DSR testset demonstrate the robustness of our scheme across different scenes, while the results on our test set show the great potential of our scheme in handling complex scenes.
  • Figure 5: Visualization of ablation studies. Experiments (a)-(d) are designed to evaluate the coarse-to-fine (C2F) strategy and the network layers (K), which are as follows: (a) w/o C2F, K=12; (b) w/ C2F, K=6; (c) w/ C2F, K=12; w/ C2F, K=18. Experiments (e)-(h) are designed to evaluate the Alignment Loss (AL), depth ratio embedding (Dpt.), and the global feature input (Glob.), which are as follows: (e) w/ C2F, K=12, AL, Dpt., and Glob.; (f) w/o AL and Dpt.; (g) w/o AL and Glob.; (h) w/o AL, Dpt. and Glob.
  • ...and 5 more figures