Table of Contents
Fetching ...

AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

Junwei Zhou, Yu-Wing Tai

TL;DR

AmodalGen3D addresses amodal 3D reconstruction from sparse, unposed views by integrating 2D amodal priors with partial 3D geometry through a dual-attention framework. The View-Wise Cross Attention aggregates multi-view completions while the Stereo-Conditioned Cross Attention leverages partial MVS geometry with a geometry-guided gating mechanism to infer unseen structure. The approach is trained with a synthetic object-centric data engine using conditional flow matching and validated across synthetic and real datasets, showing improved fidelity, completeness, and cross-view consistency over baselines. This work enables robust object-level 3D reconstruction under occlusion-heavy and sparse-view conditions, with broad implications for robotics, AR/VR, and embodied AI. Overall, AmodalGen3D demonstrates that combining strong 2D priors with geometry-aware 3D generation yields coherent, occlusion-free 3D objects even when large regions remain unobserved.

Abstract

Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.

AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

TL;DR

AmodalGen3D addresses amodal 3D reconstruction from sparse, unposed views by integrating 2D amodal priors with partial 3D geometry through a dual-attention framework. The View-Wise Cross Attention aggregates multi-view completions while the Stereo-Conditioned Cross Attention leverages partial MVS geometry with a geometry-guided gating mechanism to infer unseen structure. The approach is trained with a synthetic object-centric data engine using conditional flow matching and validated across synthetic and real datasets, showing improved fidelity, completeness, and cross-view consistency over baselines. This work enables robust object-level 3D reconstruction under occlusion-heavy and sparse-view conditions, with broad implications for robotics, AR/VR, and embodied AI. Overall, AmodalGen3D demonstrates that combining strong 2D priors with geometry-aware 3D generation yields coherent, occlusion-free 3D objects even when large regions remain unobserved.

Abstract

Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.

Paper Structure

This paper contains 19 sections, 7 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Comparison of 3D object reconstruction from sparse, unposed, and occluded inputs. From left to right: input sparse views, results from (a) MVS-based reconstruction (e.g., VGGT wang2025vggt), (b) 2D amodal completion lifted to 3D (TRELLIS xiang2025structured), and (c) AmodalGen3D (ours). Our generative model faithfully reconstructs complete, occlusion-free 3D geometry consistent with sparse-view constraints.
  • Figure 2: Overview of AmodalGen3D. Given sparse images $S$, visibility masks $M_{vis}$, and occlusion masks $M_{occ}$ indicating the occluded object $O$, AmodalGen3D first generates a sparse structure (stage 1) by aggregating multi-view information in Sec. \ref{['method:viewwise']} and infers the complete geometric structure from the partial stereo point cloud $\mathbf{P}_{O}$ in Sec. \ref{['method:scca']}. Once the sparse structure is obtained, we employ a pretrained amodal SLAT Transformer, controlling texture generation with visibility masks $M_{vis}$ and occlusion masks $M_{occ}$, and then generate the structured latent. This structured latent is decoded into an occlusion-free 3D object with high-quality geometry and appearance.
  • Figure 3: The TRELLIS model shows a strong bias and can result in artifact accumulation when conditioned with multiple views.
  • Figure 4: A detailed illustration of our proposed View-Wise Cross Attention and Stereo-Conditioned Cross Attention.
  • Figure 5: Amodal 3D object reconstruction results on the GSO dataset. We give examples generated with 1, 2, and 4 views, respectively. Note that for FreeSplatter xu2025freesplatter, we adopt the Hunyuan-MV yang2024tencent to generate multiple views under the input of single image (first two rows).
  • ...and 9 more figures