Table of Contents
Fetching ...

Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments

Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan

TL;DR

This work tackles long-term underwater monitoring by introducing a hierarchical Visual Place Recognition pipeline that fuses global image retrieval with local feature refinement, followed by homography-based registration and 2D warping of segmentation masks to enable pixel-level change analysis. It introduces the SQUIDLE+ VPR Benchmark, the first large-scale underwater VPR dataset drawn from publicly available SQUIDLE+ data, to evaluate cross-time localization across diverse trajectories and environmental conditions. The approach demonstrates substantial speedups over brute-force methods while maintaining competitive accuracy and provides qualitative and quantitative insights into change detection via segmentation warping. The dataset and method offer a scalable, centimeter-level registration capability essential for monitoring dynamic marine ecosystems and informing conservation efforts.

Abstract

Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark-the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories, arbitrary overlap and diverse seafloor types captured under varying environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc

Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments

TL;DR

This work tackles long-term underwater monitoring by introducing a hierarchical Visual Place Recognition pipeline that fuses global image retrieval with local feature refinement, followed by homography-based registration and 2D warping of segmentation masks to enable pixel-level change analysis. It introduces the SQUIDLE+ VPR Benchmark, the first large-scale underwater VPR dataset drawn from publicly available SQUIDLE+ data, to evaluate cross-time localization across diverse trajectories and environmental conditions. The approach demonstrates substantial speedups over brute-force methods while maintaining competitive accuracy and provides qualitative and quantitative insights into change detection via segmentation warping. The dataset and method offer a scalable, centimeter-level registration capability essential for monitoring dynamic marine ecosystems and informing conservation efforts.

Abstract

Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark-the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories, arbitrary overlap and diverse seafloor types captured under varying environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our image-based relocalization and alignment method for underwater ecosystem monitoring. We use hierarchical Visual Place Recognition (VPR) techniques to robustly identify common locations from images captured across multi-year timescales using video from freely navigating robots. These images exhibit significant variations in environmental conditions, such as lighting, turbidity, and depth. We establish correspondences between image keypoints to estimate a rigid transformation between the images, which we then use to register segmentation masks in a common pixel space. Finally, we apply an intersection over union (IoU) metric to detect ecosystem changes over time.
  • Figure 2: GPS trajectories (in meters) and environmental differences across underwater datasets. The visualized trajectories depict query sequences (orange) and database sequences (blue) for four underwater datasets: (a) the Eiffel Tower (2018–2020) from the Mid-Atlantic Ridge hydrothermal vent, (b) Okinawa (2016–2017–2018), capturing mesophotic coral reef environments before and after Typhoon Trami, (c) the Tasman Fracture (2018), showcasing deep-sea benthic habitats, and (d) St Helens (2011–2013), featuring images recorded during a transition movement in shallow barren zones. To analyze the unique overlap between sequences covering the same area but following different trajectories, we include a third trajectory in the Okinawa dataset (green). Accompanying RGB image pairs for each trajectory illustrate corresponding locations, highlighting the significant appearance variations that challenge automated VPR systems. Notably, even correctly matched locations exhibit substantial visual differences due to variations in viewpoint, lighting conditions, and actual ecosystem changes over time, further emphasizing the complexity of long-term visual place recognition in underwater environments.
  • Figure 3: Recall@K performance for VPR methods across underwater datasets. Experimental results illustrating the probability that a correct match appears within the top $K$ retrieved candidates. We compare six VPR methods—MixVPR, CosPlace, NetVLAD, AnyLoc, MegaLoc, and CricaVPR—against a random guesser to assess whether the retrieval results, given our GPS-based ground truth, correspond to meaningful location identification. Additionally, we include the brute-force SuperPoint approach (which performs feature matching on all possible image pairs) as an alternative that directly compares local features without global retrieval. In the Okinawa dataset, the random guesser’s performance approaches that of VPR methods for $K \gtrsim 10$ due to the lawnmower trajectory pattern (see Fig. \ref{['fig:gps_trajectories']}), which increases image density. This allows random selection to occasionally retrieve correct matches, even without visual correspondence. While this does not compromise the GPS-based ground truth, it underscores that VPR methods are most discriminative at lower $K$ values (see Section \ref{['sub:hierarchical']}). Due to its significantly higher computational cost, we do not include the SuperPoint brute-force approach on the St Helens dataset.
  • Figure 4: Top row: Precision-Recall curves for Best-Single-Match VPR. Our hierarchical SuperPoint-H method outperforms one-stage VPR approaches, achieving performance closer to the SuperPoint brute-force approach while significantly reducing computational cost (see Table \ref{['tab:times']}). Bottom row: The background represents the binary ground truth in black and white, with green crosses indicating true positives and red crosses indicating false positives for the best match per query using our SuperPoint-H method. Positive matches are filtered to exclude those with a reprojection error greater than 10 pixels, corresponding to a precision of 39% for the Eiffel Tower dataset, 99% for Okinawa, 22% for Tasman Fracture, and 72% for St Helens.
  • Figure 5: Qualitative results of warping segmentation masks and reprojection error, accompanied by intersection over union (IoU) scores for aligned masks.Leftmost column: Query RGB images from each SQUIDLE+ dataset overlaid with SAM2 segmentation masks. Second column: Database match obtained using our MegaLoc + SuperPoint hierarchical method, overlaid with SAM2 segmentation masks. Third column: Query images warped onto the selected database image using the homography estimated from LightGlue keypoints. Actual keypoints from the database image are plotted as blue circles, while projected keypoints from the query image are shown as orange crosses. The reprojection distance is marked with a yellow dashed line. Rightmost column: Query masks warped using the estimated homography and overlaid onto the database masks. Areas of intersection are colored in green, while non-overlapping query and database masks remain orange and blue, respectively.