Table of Contents
Fetching ...

No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

Mingyu Sung, Hyeonmin Choe, Il-Min Kim, Sangseok Yun, Jae Mo Kang

TL;DR

This work tackles domain shifts in monocular depth estimation by introducing PITTA, a pose-agnostic test-time adaptation framework that leverages instance-aware masking of dynamic objects and edge cues to refine depth predictions without camera pose information. It combines a pretrained MDE network with a frozen panoptic segmentation model, uses instance-wise masks and Laplacian-derived edges, and optimizes a dual loss $L = L_d + \lambda L_e$ while updating a selective parameter set $\theta$. Experiments on DrivingStereo and Waymo demonstrate state-of-the-art improvements across multiple MDE metrics and backbones, confirming both robustness to diverse conditions and transferability across architectures. The approach reduces reliance on pose estimation and highlights practical gains for real-world deployment of MDE in dynamic environments.

Abstract

Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.

No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

TL;DR

This work tackles domain shifts in monocular depth estimation by introducing PITTA, a pose-agnostic test-time adaptation framework that leverages instance-aware masking of dynamic objects and edge cues to refine depth predictions without camera pose information. It combines a pretrained MDE network with a frozen panoptic segmentation model, uses instance-wise masks and Laplacian-derived edges, and optimizes a dual loss while updating a selective parameter set . Experiments on DrivingStereo and Waymo demonstrate state-of-the-art improvements across multiple MDE metrics and backbones, confirming both robustness to diverse conditions and transferability across architectures. The approach reduces reliance on pose estimation and highlights practical gains for real-world deployment of MDE in dynamic environments.

Abstract

Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.

Paper Structure

This paper contains 10 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (a) Reconstructed images during TTA based on SfM assumption by following godard2019digging with and without pose estimation network. (b) TTA performance of our proposed method and other competing methods over 7 MDE metrics averaged over 4 different test cases on DrivingStereo and Waymo datasets.
  • Figure 2: Overall architecture and schematic diagram of our TTA framework to adapt a pretrained MDE network given sequences of single RGB images from a monocular camera. Detailed computation procedures are presented in Algorithm 1 of Appendix C.