Table of Contents
Fetching ...

Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change

Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan

Abstract

Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.

Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change

Abstract

Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.
Paper Structure (6 sections, 9 figures, 2 tables)

This paper contains 6 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Long-term multi-session 3D reconstruction for coral reef monitoring. Our method generates a single coherent 3D reconstruction from video-derived images of a coral reef captured over a three-year period. The figure presents the combined 3D point cloud alongside the fields of view of three images acquired from the same location in each year. Despite substantial temporal changes, images of the same area remain aligned with pixel-level accuracy within a shared coordinate frame.
  • Figure 2: Overview of the proposed reconstruction pipeline. RGB images acquired from multiple visits to the same area are provided to a feature extraction and matching module, together with a distance matrix generated from visual place recognition. Images within the same survey are matched exhaustively using fast handcrafted features to ensure robust intra-session reconstruction. For images across different visits, candidate cross-session pairs are first identified using the VPR-based distance matrix. A learned feature matcher is then applied selectively to these candidates to establish reliable cross-session correspondences, which are enforced directly during joint Structure-from-Motion optimization.
  • Figure 3: Multi-year AUV surveys at Sesoko Island (2016–2018).Top: approximate GPS trajectories of repeated lawnmower-pattern surveys covering the same reef area across multiple years. Bottom: example images captured at the same location before and after Typhoon Trami (Paeng), illustrating substantial appearance variation and structural disruption. These changes severely challenge cross-session correspondence establishment and long-term reconstruction.
  • Figure 4: Examples of selected image pairs used for cross-session evaluation. The shown pairs are spatially distributed across the survey area and drawn from different years. These image pairs form the basis for manual annotation of cross-session correspondences used in the reprojection error evaluation.
  • Figure 5: Qualitative alignment of 3D points across two visits. Projections of reconstructed 3D points from two survey sessions (SSK16 in blue, SSK17 in yellow) overlaid on the images (left) and visualized in 3D (right). The top row shows the input images, manually annotated cross-session correspondences, and a warped cross-session image to indicate the expected field-of-view overlap. In the subsequent rows, yellow circles denote annotated correspondences, blue crosses indicate projected points, and red lines represent reprojection error; corresponding 3D projections are shown on the right. While COLMAP + ICP yields point clouds that appear coarsely aligned in 3D, substantial pixel-level misalignment remains. COLMAP + BUFFER-X improves global overlap but exhibits residual local inconsistencies. Our method achieves close agreement with the expected overlap, indicating accurate cross-session alignment.
  • ...and 4 more figures