Table of Contents
Fetching ...

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong

Abstract

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Abstract

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

Paper Structure

This paper contains 17 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Foundational monocular depth estimators fail on fisheye images. Despite being trained on large-scale datasets, foundational monocular depth estimators (FMDEs) models produces erroneous outputs. The inaccurate, blurry estimates are caused by a covariate shift that stem from fisheye distortion.
  • Figure 2: Inference on different cameras. Calibration Tokens enable foundational monocular depth estimators to adapt to fisheye images while maintaining performance on perspective images.
  • Figure 3: Overview of our method. We introduce a set of trainable Calibration Tokens, which is appended to the input sequence of the fisheye image tokens. The Calibration Tokens are trained to adapt the model to produce accurate depth maps for images with various fisheye distortions. A unique fisheye calibration token is appended to the input of each new layer of the encoder.
  • Figure 4: Comparison on ScanNet++(Indoor) yeshwanth2023scannet++ and KITTI-360 liao2022kitti dataset. Qualitative comparison results on ScanNet++ and KITTI-360 datasets. Here, +C. T. indicates prediction results by appending Calibration Tokens to patch embeddings of the model located above. Calibration Tokens enable models to adapt to different fisheye cameras, especially in regions with large distortions.
  • Figure 5: t-SNE plot of fisheye and perspective embeddings. Fisheye embeddings become closer to those of perspective images after being modulated by Calibration Tokens.
  • ...and 6 more figures