Table of Contents
Fetching ...

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham

TL;DR

<3-5 sentence high-level summary> Amodal3R tackles occlusion in 3D reconstruction by directly operating in a 3D latent space and conditioning on visibility and occlusion priors, avoiding a separate 2D amodal completion step. It extends the TRELLIS framework with mask-weighted cross-attention and an occlusion-aware layer to jointly reconstruct geometry and texture from partially visible inputs. Trained on synthetic data, Amodal3R achieves state-of-the-art performance on multiple benchmarks and demonstrates strong generalization to real-world scenes and in-the-wild images. The work establishes a new benchmark for occlusion-aware 3D reconstruction and highlights the benefit of end-to-end 3D-centric amodal reasoning in challenging real-world scenarios.

Abstract

Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

TL;DR

<3-5 sentence high-level summary> Amodal3R tackles occlusion in 3D reconstruction by directly operating in a 3D latent space and conditioning on visibility and occlusion priors, avoiding a separate 2D amodal completion step. It extends the TRELLIS framework with mask-weighted cross-attention and an occlusion-aware layer to jointly reconstruct geometry and texture from partially visible inputs. Trained on synthetic data, Amodal3R achieves state-of-the-art performance on multiple benchmarks and demonstrates strong generalization to real-world scenes and in-the-wild images. The work establishes a new benchmark for occlusion-aware 3D reconstruction and highlights the benefit of end-to-end 3D-centric amodal reasoning in challenging real-world scenarios.

Abstract

Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Paper Structure

This paper contains 43 sections, 4 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Example results of Amodal3R. Given partially visible objects within images (occluded regions are shown in black, visible areas in red outlines), our Amodal3R generates diverse semantically meaningful 3D assets with reasonable geometry and plausible appearance. We sample multiple times to get different results from the same occluded input. Trained on synthetic datasets, it generalizes well to real-scene and in-the-wild images, where most objects are partially visible, and reconstructs reasonable 3D assets.
  • Figure 2: Overview of Amodal3R. Given an image as input and the regions of interest, Amodal3R first extracts the partially visible target object, along with the visibility and occlusion masks using an off-the-shelf 2D segmenter. It then applies DINOv2 oquabdinov2 to extract features $\bm{c}_{dino}$ as additional conditioning for the 3D reconstructor. To enhance occlusion reasoning, each transformer block incorporates a mask-weighted cross-attention (via $\bm{c}_{vis}$) and occlusion-aware attention layer (via $\bm{c}_{occ}$), ensuring the 3D reconstructor accurately perceives visible information while effectively inferring occluded parts. For conditioning details, see \ref{['sec:mask-weighted-cross-attn']}.
  • Figure 3: The transformer structure of Amodal3R. Compared with the original TRELLIS xiang2024structured design, we further introduce the mask-weighted cross-attention and occlusion-aware layer. It applies to both sparse structure and SLAT diffusion models.
  • Figure 4: 3D-consistent mask example. Given a 3D mesh, we render selected triangles in a distinct color from the others to generate multi-view consistent masks. It allows the evaluation of multi-view methods in handling contact occlusion. (The occluded regions are shown in red.)
  • Figure 5: Single-view amodal 3D reconstruction. The occlusion regions are shown in black and the visible regions are highlighted with red outlines. More examples are provided in supplementary material Fig. C.4.
  • ...and 12 more figures