Table of Contents
Fetching ...

Reinforcement Learning Meets Visual Odometry

Nico Messikommer, Giovanni Cioffi, Mathias Gehrig, Davide Scaramuzza

TL;DR

This work addresses the brittleness and hyperparameter-tuning burden of Visual Odometry (VO) by reframing VO as a sequential decision problem. It introduces a neural agent with a Perceiver-based Variable Encoder that adaptively selects keyframe decisions and keypoint grid sizes based on real-time VO state, trained with Proximal Policy Optimization (PPO) and a privileged critic. The reward combines pose alignment error in a sliding window with non-differentiable metrics like runtime, enabling online trade-offs. Across EuRoC, TUM-RGBD, and KITTI, RL-enhanced VO achieves up to 19% improvements in $ATE$ and greater robustness, demonstrating generalization across VO backbones (e.g., $SVO$, $DSO$) and reducing the need for extensive offline tuning.

Abstract

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.

Reinforcement Learning Meets Visual Odometry

TL;DR

This work addresses the brittleness and hyperparameter-tuning burden of Visual Odometry (VO) by reframing VO as a sequential decision problem. It introduces a neural agent with a Perceiver-based Variable Encoder that adaptively selects keyframe decisions and keypoint grid sizes based on real-time VO state, trained with Proximal Policy Optimization (PPO) and a privileged critic. The reward combines pose alignment error in a sliding window with non-differentiable metrics like runtime, enabling online trade-offs. Across EuRoC, TUM-RGBD, and KITTI, RL-enhanced VO achieves up to 19% improvements in and greater robustness, demonstrating generalization across VO backbones (e.g., , ) and reducing the need for extensive offline tuning.

Abstract

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.
Paper Structure (17 sections, 3 equations, 7 figures, 12 tables)

This paper contains 17 sections, 3 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Our Framework. We propose to employ a learned agent to adaptively guide a VO method using real-time observations for enhanced robustness and accuracy. By considering the problem as a sequential decision process, we use RL to train the agent primarily based on the position error computed within a sliding window (dashed lines).
  • Figure 2: Agent Overview. Our agent takes as input the map statistics computed by the VO method at timestep $t_{i-1}$ and a variable number of keypoints using a multi-head attention layer. Given these inputs, a two-layer MLP computes a multi-discrete probability distribution over the binary keyframe action and the grid size action. These actions are then used inside the VO method to process the current frame at timestep $t_i$, which will lead to the observation for the next timestep $t_{i+1}$.
  • Figure 3: Position Error. To closely relate the pose prediction accuracy to the current action, we employ a sliding window of five timesteps to align the ground truth and estimated trajectory using a scale factor $s$, a translation vector $t$, and a rotation $R$. The error at the current timestep $t_i$ is then used as the negative position reward.
  • Figure 4: Keyframe Selection. The number of selected keyframes by SVO at different translational and angular velocities for the EuRoC dataset.
  • Figure 5: Top-Down View of Trajectories. Visualization of the predicted trajectory using SVO with and without RL.
  • ...and 2 more figures