Table of Contents
Fetching ...

ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue

Thomas Pritchard, Saifullah Ijaz, Ronald Clark, Basaran Bahadir Kocer

TL;DR

ForestVO proposes an end-to-end deep-learning visual odometry pipeline tailored for forest environments by combining ForestGlue, a forest-domain–adapted feature extractor and matcher built on SuperPoint with LightGlue/SuperGlue, and a transformer-based pose estimator. By retraining matchers on synthetic forest data and using a forest-specific training loss, the system achieves robust feature correspondence with as few as $512$ keypoints, yielding a LO-RANSAC AUC of $0.745$ at a $10^ op^{ aisebox{1pt}{$^ullet$}}$ threshold while enabling real-time operation on resource-constrained hardware. The pose-estimation model, trained on forest sequences, delivers an average relative pose error of $1.09$ m and a $kitti\_score$ of $2.33\%$ on challenging TartanAir sequences, outperforming direct methods like DSO in dynamic scenes and remaining competitive with TartanVO despite using only $10\%$ of the dataset. This work demonstrates a practical, end-to-end VO framework for forests, highlighting the value of domain adaptation and lightweight architectures for autonomous navigation in unstructured natural environments.

Abstract

Recent advancements in visual odometry systems have improved autonomous navigation; however, challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise feature correspondence accuracy. To address these challenges, we introduce ForestGlue, enhancing the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline models but requires only 512 keypoints - just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a 10° threshold. With only a quarter of keypoints needed, ForestGlue significantly reduces computational overhead, demonstrating effectiveness in dynamic forest environments, and making it suitable for real-time deployment on resource-constrained platforms. By combining ForestGlue with a transformer-based pose estimation model, we propose ForestVO, which estimates relative camera poses using matched 2D pixel coordinates between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10% of the dataset for training, ForestVO maintains competitive performance with TartanVO while being a significantly lighter model. This work establishes an end-to-end deep learning pipeline specifically tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation, thereby enhancing the accuracy and robustness of autonomous navigation systems.

ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue

TL;DR

ForestVO proposes an end-to-end deep-learning visual odometry pipeline tailored for forest environments by combining ForestGlue, a forest-domain–adapted feature extractor and matcher built on SuperPoint with LightGlue/SuperGlue, and a transformer-based pose estimator. By retraining matchers on synthetic forest data and using a forest-specific training loss, the system achieves robust feature correspondence with as few as keypoints, yielding a LO-RANSAC AUC of at a ^ullet threshold while enabling real-time operation on resource-constrained hardware. The pose-estimation model, trained on forest sequences, delivers an average relative pose error of m and a of on challenging TartanAir sequences, outperforming direct methods like DSO in dynamic scenes and remaining competitive with TartanVO despite using only of the dataset. This work demonstrates a practical, end-to-end VO framework for forests, highlighting the value of domain adaptation and lightweight architectures for autonomous navigation in unstructured natural environments.

Abstract

Recent advancements in visual odometry systems have improved autonomous navigation; however, challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise feature correspondence accuracy. To address these challenges, we introduce ForestGlue, enhancing the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline models but requires only 512 keypoints - just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a 10° threshold. With only a quarter of keypoints needed, ForestGlue significantly reduces computational overhead, demonstrating effectiveness in dynamic forest environments, and making it suitable for real-time deployment on resource-constrained platforms. By combining ForestGlue with a transformer-based pose estimation model, we propose ForestVO, which estimates relative camera poses using matched 2D pixel coordinates between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10% of the dataset for training, ForestVO maintains competitive performance with TartanVO while being a significantly lighter model. This work establishes an end-to-end deep learning pipeline specifically tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation, thereby enhancing the accuracy and robustness of autonomous navigation systems.

Paper Structure

This paper contains 20 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: ForestVO System Architecture: ForestVO consists of ForestGlue and a deep-learning pose estimation model. Image pairs are passed through ForestGlue which performs feature detection using multi-modal SuperPoint and feature matching using either SuperGlue or LightGlue, which have been refined on forest-specific datasets. The matched 2D keypoint coordinates between sequential frames are passed into the pose transformer model, which estimates the relative camera pose by predicting the rotation matrix $\mathbf{R}$ and translation vector $\mathbf{T}$. The relative poses are concatenated to generate a predicted trajectory for the sequence of input images.
  • Figure 2: TartanAir Pre-trained Model Precision: Deep learning approaches outperformed traditional methods.
  • Figure 3: FinnForest Pre-trained Models' Precision: SuperGlue and LightGlue showed comparable performance to other learned methods with a decreased computational overhead.
  • Figure 4: TartanAir Multi-Modal LightGlue Relative Pose Error: The relative pose error of RGB and grayscale models showed similar performance. RGB-D and stereo models showed a significant decrease across all thresholds -- Gray (Blue), RGB (Green), RGB-D (Yellow), and Stereo (Red).
  • Figure 5: Estimated and Ground Truth Trajectories: A visualisation of the seasonsforest_sample_P002 sequence in the TartanAir dataset, comparing the trajectory estimated by our system with the corresponding ground truth. The sequence contains 301 poses and is approximately 50 m in length.
  • ...and 1 more figures