Table of Contents
Fetching ...

WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments

Joshua Knights, Joseph Reid, Kaushik Roy, David Hall, Mark Cox, Peyman Moghadam

TL;DR

This work proposes WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments, and conducts comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks.

Abstract

Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.

WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments

TL;DR

This work proposes WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments, and conducts comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks.

Abstract

Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
Paper Structure (21 sections, 5 equations, 6 figures, 9 tables)

This paper contains 21 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The global maps of two sequences from WildCross. The left panels show the RGB images (top), annotated depth images (middle), and lidar submaps (bottom) at locations A1 and A2. These correspond to revisits of the same location from opposite directions across different sessions. WildCross presents a challenging new benchmark for cross-modal place recognition and metric depth estimation, with eight traversals covering diverse viewpoints in two large-scale forests.
  • Figure 2: WildCross overview (a) RGB Image, (b) Depth Image, (c) Depth Overlay, (d) Surface Normal, (e) Lidar Submap.
  • Figure 3: Impact of visibility estimation. (a) RGB Image, (b) Naïve projection of global 3D points produces noisy depth maps with occluded points. (c) Our visibility pipeline removes these, yielding higher-quality depth.
  • Figure 4: Depth distribution for WildCross ($\bullet$) vs. KITTI Annotated Depth ($\bullet$) Uhrig2017THREEDV. Violin plots are computed from 1% subsamples of both datasets. Width and height are normalized with respect to image sizes.
  • Figure 5: Cross-sequence VPR R1 on WildCross. Reverse revisit sequences V-02 and K-02 significantly degrade performance, underscoring the challenge of viewpoint diversity.
  • ...and 1 more figures