Table of Contents
Fetching ...

Amodal Depth Anything: Amodal Depth Estimation in the Wild

Zhenyu Li, Mykola Lavreniuk, Jian Shi, Shariq Farooq Bhat, Peter Wonka

TL;DR

This work tackles amodal depth estimation in real-world images by reframing the problem as relative depth prediction to improve cross-domain generalization. It introduces the Amodal Depth In the Wild (ADIW) dataset, generated via segmentation-guided compositing and scale-and-shift depth alignment, enabling scalable real-world supervision. Two architectures, Amodal-DAV2 (deterministic) and Amodal-DepthFM (generative), leverage large pre-trained depth models with minimal modification and guidance channels to predict occluded-depth structures; both achieve strong performance with Amodal-DAV2-L delivering new SoTA on ADIW. The results highlight the value of object-level supervision, guidance signals, and alignment techniques for coherent amodal depth estimates, with practical implications for occluded geometry understanding and downstream tasks like 3D reconstruction and inpainting.

Abstract

Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 69.5% improvement in accuracy over the previous SoTA on the ADIW dataset.

Amodal Depth Anything: Amodal Depth Estimation in the Wild

TL;DR

This work tackles amodal depth estimation in real-world images by reframing the problem as relative depth prediction to improve cross-domain generalization. It introduces the Amodal Depth In the Wild (ADIW) dataset, generated via segmentation-guided compositing and scale-and-shift depth alignment, enabling scalable real-world supervision. Two architectures, Amodal-DAV2 (deterministic) and Amodal-DepthFM (generative), leverage large pre-trained depth models with minimal modification and guidance channels to predict occluded-depth structures; both achieve strong performance with Amodal-DAV2-L delivering new SoTA on ADIW. The results highlight the value of object-level supervision, guidance signals, and alignment techniques for coherent amodal depth estimates, with practical implications for occluded geometry understanding and downstream tasks like 3D reconstruction and inpainting.

Abstract

Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 69.5% improvement in accuracy over the previous SoTA on the ADIW dataset.

Paper Structure

This paper contains 22 sections, 4 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Amodal Depth Estimation in the Wild. For each image, we present the general depth estimation result alongside our amodal depth estimation with the target object outlined in black. Our model demonstrates strong generalization across diverse scenes for accurate depth estimation for occluded parts of objects. Best viewed in color.
  • Figure 2: Amodal Depth Estimation Pipeline. Given an input image, users can generate the amodal mask for the depth estimator in two ways: (1) Model Heuristics: click the target object, apply SAM kirillov2023sam to generate modal mask, then use amodal segmentation methods to estimate amodal mask, (2) Human Heuristics: manually draw the amodal mask. Our model estimates amodal depth based on original observation image $I_o$, the observed depth map $D_o$, and the amodal mask $M_a$.
  • Figure 3: Constructing Training Data. We use the method from ozguroglu2024pix2gestalt to convert an initial segmentation dataset into a whole-object dataset. Next, we sample and composite images to create training pairs. Due to occluders, the relative depth predictions differ between the composite and background images, so we apply scale-and-shift alignment for consistent depth blending.
  • Figure 4: Amodal-DAV2 Framework Structure. Amodal-DAV2 modifies the DAV2 image encoder to take additional guidance channels along with RGB.
  • Figure 5: Amodal-DepthFM Framework Structure. Amodal-DepthFM modifies the DepthFM denoising UNet encoder to take additional guidance channels along with RGB latent code.
  • ...and 7 more figures