Table of Contents
Fetching ...

Deep Visual Odometry with Events and Frames

Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza

TL;DR

Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.

Abstract

Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. To improve robustness, recent model-based VO systems have begun combining standard and event-based cameras. While event cameras excel in low-light and high-speed motion, standard cameras provide dense and easier-to-track features. However, the field of image- and event-based VO still predominantly relies on model-based methods and is yet to fully integrate recent image-only advancements leveraging end-to-end learning-based architectures. Seamlessly integrating the two modalities remains challenging due to their different nature, one asynchronous, the other not, limiting the potential for a more effective image- and event-based VO. We introduce RAMP-VO, the first end-to-end learned image- and event-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders capable of fusing asynchronous events with image data, providing 8x faster inference and 33% more accurate predictions than existing solutions. Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.

Deep Visual Odometry with Events and Frames

TL;DR

Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.

Abstract

Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. To improve robustness, recent model-based VO systems have begun combining standard and event-based cameras. While event cameras excel in low-light and high-speed motion, standard cameras provide dense and easier-to-track features. However, the field of image- and event-based VO still predominantly relies on model-based methods and is yet to fully integrate recent image-only advancements leveraging end-to-end learning-based architectures. Seamlessly integrating the two modalities remains challenging due to their different nature, one asynchronous, the other not, limiting the potential for a more effective image- and event-based VO. We introduce RAMP-VO, the first end-to-end learned image- and event-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders capable of fusing asynchronous events with image data, providing 8x faster inference and 33% more accurate predictions than existing solutions. Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.
Paper Structure (10 sections, 8 equations, 5 figures, 4 tables)

This paper contains 10 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Recurrent, Asynchronous, and Massively Parallel (RAMP) Encoders are used to process asynchronous events and images. Patches are extracted from the resulting encoding and used by the Estimator inspired by DPVO dpvo to perform data-driven feature tracking and visual odometry. A simple pose forecasting module exploits previously extracted patches to initialize poses in the bundle adjustment, allowing for improved performance.
  • Figure 2: An overview of the proposed RAMP Net encoder. Events and images are first asynchronously processed by two parallel pixel-wise, multi-scale, encoding branches (PWE) made of a set of convolutional layers followed by pixel-wise LSTMs $G_k^s$. A shared state $\Sigma^s_t$ is then updated (SU) with features coming from different data modalities $k$ by employing sensor-specific encoders at each scale. The multi-scale features are then finally combined through two separate fusion modules (MSF) to produce the matching and context features, $m_t$ and $c_t$
  • Figure 3: Illustration of pose initialization. Through patch extraction and projection into future frames we construct feature tracks for frames $j,j-1,...$ which we use to construct the splines $S^l(t_j)$. To perform pose initialization, we extrapolate the feature tracks to time $t_{j+n}$, and apply bundle adjustment to solve for the forecasted pose $T_{j+n}$.
  • Figure 4: Comparisons of the ablated models (a) and of RAMP with asynchronous and synchronized data (b) on the full TartanAir test set. We show the importance of using a RAMP encoder as in RAMP-VO over a sequential, single-scale encoder and feed-forward encoder. We also show the benefits of using the full event information in RAMP-VO with finer discretizations. The RAMP encoder is better at maintaining memory than the RAM-Net-like encoder, as highlighted in low framerate VO experiments (c) on the carwelding sequences of TartanAir.
  • Figure 5: Preview, and qualitative trajectory comparison on Malapert dataset (a), and Apollo dataset (b). Note that while the Malapert sequence is measured in kilometers, the Apollo sequence, recorded at a miniature scale of the Moon's surface, is in centimeters.