Table of Contents
Fetching ...

LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Weirong Chen, Le Chen, Rui Wang, Marc Pollefeys

TL;DR

LEAP addresses robust visual odometry in dynamic environments by moving beyond two-view tracking to long-term point tracking that leverages temporal context. It introduces LEAP, which fuses visual cues, inter-track information via anchors, and a temporal probabilistic model to estimate trajectory distributions and per-point uncertainty, expressed as $p(\mathbf{X}|\mathbf{I},\mathbf{x}_q)$ with a multivariate Cauchy formulation. The LEAP front-end feeds into LEAP-VO, which tracks points over a window, filters tracks by visibility, dynamism, and uncertainty, and optimizes poses with a sliding-window BA using LEAP mappings $\text{LEAP}_{i\rightarrow j}$. Experiments on Replica, MPI Sintel, and TartanAir demonstrate substantial improvements over state-of-the-art baselines, particularly in dynamic scenes and under occlusion, highlighting the practical impact of long-term, uncertainty-aware tracking for VO.

Abstract

Visual odometry estimates the motion of a moving camera based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich temporal context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder performance in scenarios with occlusion, dynamic objects, and low-texture areas. To address these challenges, we present the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP's temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty. Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the proposed pipeline significantly outperforms existing baselines across various visual odometry benchmarks.

LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

TL;DR

LEAP addresses robust visual odometry in dynamic environments by moving beyond two-view tracking to long-term point tracking that leverages temporal context. It introduces LEAP, which fuses visual cues, inter-track information via anchors, and a temporal probabilistic model to estimate trajectory distributions and per-point uncertainty, expressed as with a multivariate Cauchy formulation. The LEAP front-end feeds into LEAP-VO, which tracks points over a window, filters tracks by visibility, dynamism, and uncertainty, and optimizes poses with a sliding-window BA using LEAP mappings . Experiments on Replica, MPI Sintel, and TartanAir demonstrate substantial improvements over state-of-the-art baselines, particularly in dynamic scenes and under occlusion, highlighting the practical impact of long-term, uncertainty-aware tracking for VO.

Abstract

Visual odometry estimates the motion of a moving camera based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich temporal context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder performance in scenarios with occlusion, dynamic objects, and low-texture areas. To address these challenges, we present the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP's temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty. Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the proposed pipeline significantly outperforms existing baselines across various visual odometry benchmarks.
Paper Structure (17 sections, 8 equations, 7 figures, 6 tables)

This paper contains 17 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between two-view approach and multi-view approach for visual odometry. (a) In the two-view approach, correspondences are derived for every image pair and concatenated. Managing occlusion becomes challenging without the temporal context. (b) In the multi-view approach, long-term point trajectories can be obtained all at once, enabling the detection of occlusion and tracking points under partial occlusions.
  • Figure 2: Distributed image gradient-based sampling with $k=8, N_a=64$. After computing the image gradient and pooling, we split the gradient map into 8×8 grids and select the point with the maximum gradient in each grid.
  • Figure 3: LEAP Front-end. Once image feature maps are obtained, selected anchors aid in tracking. The queries and anchors are processed by a refiner to iteratively update states. The model outputs trajectory distribution, visibility, and dynamic track label.
  • Figure 4: LEAP-VO pipeline. Given a new image $\mathbf{I}_t$ received at time $t$, the keypoint extractor extracts new keypoints from $\mathbf{I}_t$. Then, all the keypoints from the latest $S_{LP}$-frame $\mathbf{I}_{t-S_{LP}+1:t}$ are tracked across all other frames within the current LEAP window, followed by a track filtering step to remove outliers. Finally, the local BA module is used on the current BA window to update the camera poses and 3D positions of the extracted keypoints. The colored arrows denote the moving direction of each module when a new image is received.
  • Figure 5: Qualitative results of camera trajectory estimation on TartanAir-Shibuya qiu2022airdos. The visualizations show that our method provides more robust and accurate camera pose trajectories, especially in hard cases (RoadCrossing 06 and RoadCrossing 07).
  • ...and 2 more figures