Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Martin Kvisvik Larsen; Oscar Pizarro

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Martin Kvisvik Larsen, Oscar Pizarro

Abstract

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Abstract

Paper Structure (28 sections, 17 equations, 16 figures, 6 tables)

This paper contains 28 sections, 17 equations, 16 figures, 6 tables.

Introduction
Methods
Data Source and Dataset Overview
Image Color Correction
Geometric Reconstruction and Registration
Estimating Image Seafloor Footprints
Camera Model
Range Map Fusion
Image Footprint and Overlap Estimation
Evaluating Long-Term Visual Place Recognition
Selected Visual Place Recognition Models
Evaluation Protocol and Metrics
Results
Geometric Registration
Linking Camera Views With Overlapping Image Footprints
...and 13 more sections

Figures (16)

Figure 1: Map with overview of the benthic reference sites in our dataset. (A) Map of Australia and the geographic location of each site. (B-F) Orthomosaics showing seafloor patches with characteristics representative for the following benthic reference sites; (B) Site 1, (C) Site 2, (D) Site 3, (E) Site 4, and (F) Site 5. The orthomosaic patches are rendered at the same spatial scale, indicated by the scalebar in the lower right corner. Base map data © Commonwealth of Australia (Australian Bureau of Statistics) 2021, ASGS Edition 3 digital boundaries, used under CC BY 4.0 australian_bureau_of_statistics_australian_2021.
Figure 2: (A) Raw (left) and color corrected (right) image from the 2010 visit to Site 1. (B) Raw (left) and color corrected (right) image from the 2013 visit to Site 1.
Figure 3: Single line diagram of the geometric reconstruction and registration workflow for an example with two visits. In the example Visit 1 is the registration target, while Visit 2 is the registration source being registered to Visit 1. Note that for the target visit, the -estimated camera calibrations and interpolated camera poses are the final calibrations and camera poses, while for the source visit they are corrected by the similarity transformation estimated by the geometric registration.
Figure 4: An example showing the fusion of relative and metric stereo-derived range maps for an image from 2009 visit to Site 4. (A) The color corrected image, (B) the stereo range map reprojected into the pixel coordinate system of the original image, and (C) the range map estimated by fusing the relative range map with the stereo-derived range map. The stereo-derived and fused range maps are rendered with the same color map to indicate the range. Overall, the fused range maps show sharper edges and do not exhibit the geometric inconsistencies on the left side seen in the stereo range map, which are believed to be caused by limited left–right image overlap.
Figure 5: Illustration of a 2D simplified model of our footprint estimation method for two scenarios. Panel (A) shows a scenario where local terrain relief causes non-overlapping image footprints for two spatially close camera views. Panel (B) shows a scenario where large altitude differences causes spatially distant camera views to have overlapping image footprints.
...and 11 more figures

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Abstract

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Authors

Abstract

Table of Contents

Figures (16)