Table of Contents
Fetching ...

FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization

Daniel Lichy, Hang Su, Abhishek Badki, Jan Kautz, Orazio Gallo

TL;DR

This work introduces FoV-Depth, a generalized depth estimation framework for Generalized Central Cameras (GCCs) that trains solely on small-FoV pinhole data yet generalizes to large-FoV imagery at test time. The core idea is Extrinsic Rotation Augmentation (ERA), which warps inputs to a canonical representation (ERP or cubemap) to expose distortions and teach the network to reason about wide-FoV geometry; this is paired with padding-aware convolution operators (CircConv and CubeConv) to maintain continuity across the canonical representations. The method demonstrates cross-dataset generalization in indoor (ScanNet to Matterport360) and outdoor (DDAD to KITTI-360) settings, outperforming or matching specialized baselines like MODE and 360MVSNet. The approach is practical for automotive and real-estate applications and opens avenues for extending pinhole-trained depth models to arbitrary FoV via representation-specific processing and efficient sampling strategies such as Reciprocal Tangent Sampling.

Abstract

Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to properly train depth estimation models for large-FoV cameras. We propose the first method to train a stereo depth estimation model on the widely available pinhole data, and to generalize it to data captured with larger FoVs. Our intuition is simple: We warp the training data to a canonical, large-FoV representation and augment it to allow a single network to reason about diverse types of distortions that otherwise would prevent generalization. We show strong generalization ability of our approach on both indoor and outdoor datasets, which was not possible with previous methods.

FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization

TL;DR

This work introduces FoV-Depth, a generalized depth estimation framework for Generalized Central Cameras (GCCs) that trains solely on small-FoV pinhole data yet generalizes to large-FoV imagery at test time. The core idea is Extrinsic Rotation Augmentation (ERA), which warps inputs to a canonical representation (ERP or cubemap) to expose distortions and teach the network to reason about wide-FoV geometry; this is paired with padding-aware convolution operators (CircConv and CubeConv) to maintain continuity across the canonical representations. The method demonstrates cross-dataset generalization in indoor (ScanNet to Matterport360) and outdoor (DDAD to KITTI-360) settings, outperforming or matching specialized baselines like MODE and 360MVSNet. The approach is practical for automotive and real-estate applications and opens avenues for extending pinhole-trained depth models to arbitrary FoV via representation-specific processing and efficient sampling strategies such as Reciprocal Tangent Sampling.

Abstract

Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to properly train depth estimation models for large-FoV cameras. We propose the first method to train a stereo depth estimation model on the widely available pinhole data, and to generalize it to data captured with larger FoVs. Our intuition is simple: We warp the training data to a canonical, large-FoV representation and augment it to allow a single network to reason about diverse types of distortions that otherwise would prevent generalization. We show strong generalization ability of our approach on both indoor and outdoor datasets, which was not possible with previous methods.
Paper Structure (51 sections, 10 equations, 10 figures, 4 tables)

This paper contains 51 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Our FoV-agnostic MVS model can be trained on small-FoV pinhole data and generalizes to images of various FoVs and camera models at inference time.
  • Figure 2: To estimate FoV-agnostic depth, we warp the inputs to a target representation (e.g., cubemap or ERP). We introduce Extrinsic Rotation Augmentations so that images are warped to all areas of this representation at training time (b). This forces the model trained on pinhole data to learn to reason about distortions in other types of images.
  • Figure 3: Our MVS pipeline. Here we only show the architecture for cubemap, but the same pipeline can be used for ERP by simply switching the convolution operations.
  • Figure 4: (a) We pad a side of the ERP by replicating pixel values from the opposite side. (b) We pad the green cube face with the interpolated value of the green point projected on to the orange face. Transparent squares indicate the convolution filters.
  • Figure 5: Generalization results of our approach on Matterport360 (top) and KITTI-360 (bottom). For indoor scenes, our approach trained with ERA for both ERP and cubemap representations outperform competing approaches Li_Jin2022MODEchiu360mvsnet. For outdoor scenes, our approach generalizes better than MODE Li_Jin2022MODE trained only on large-FoV synthetic data. Our approach can naturally use additional views. Our 3-view stereo shows better reconstructions (see the highlighted regions) for both ERP and cubemap representations.
  • ...and 5 more figures