Table of Contents
Fetching ...

Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

Simone Nascivera, Leonard Bauersfeld, Jeff Delaune, Davide Scaramuzza

Abstract

Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.

Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

Abstract

Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.
Paper Structure (19 sections, 5 equations, 3 figures, 5 tables)

This paper contains 19 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Our reinforcement learning (RL) agent dynamically selects feature detection and tracking parameters based on the current image. By conditioning decisions on visual appearance, the policy adapts the frontend to scene characteristics such as texture density, illumination, motion blur, and sensor noise. The policy is trained to minimize feature drift while maximizing spatial coverage and computational efficiency. Training is performed in simulation using TartanAirV2, while evaluation is conducted on unseen synthetic sequences and real-world sequences from TUM RGB-D, demonstrating strong sim-to-real generalization.
  • Figure 2: Since TartanAirV2 does not feature realistic noise and blur dynamics, we augment the dataset. For evaluation we seperately consider the nominal and the augmented versions.
  • Figure 3: Comparison of our method on synthetic data. We train RL (ours) on a training set of 40 sequences and plot the evaluation results on 5 unseen test sequences. The PSO opt. on training set baseline is optimized using the same training data as RL and then evaluated on the test set. PSO opt. on test set represents an unfair comparison and upper bound achievable by any static parameters, as we optimize directly on the test set. Especially in the challenging, yet realistic Blur + Noise setting, our method clearly outperforms the baseline and gets close to the static parameters found for the test set.