Table of Contents
Fetching ...

Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization

Lahav Lipson, Jia Deng

TL;DR

The paper tackles monocular Multi-Session SLAM across disjoint video sequences by introducing a differentiable backbone that jointly predicts bi-directional optical flow and camera pose updates, coupled with a differentiable symmetric epipolar distance (SED) solver. A dense pre-conditioning step and iterative update operator enable accurate wide-baseline two-view pose estimation and seamless visual odometry, while trajectory alignment via Sim(3) and NetVLAD-based retrieval unifies disjoint sequences into a single global frame. The approach is validated on real-world datasets (EuRoC-MAV, ETH3D, Scannet, Megadepth), showing superior accuracy and robustness to catastrophic failures compared to state-of-the-art monocular multi-session methods, and competitive two-view pose results against transformer-based matchers. The work demonstrates that end-to-end differentiable backbones with recurrent pose/motion solvers can effectively integrate cross-session matching, pose refinement, and global optimization in monocular SLAM, enabling robust multi-session mapping in challenging scenarios.

Abstract

We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose

Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization

TL;DR

The paper tackles monocular Multi-Session SLAM across disjoint video sequences by introducing a differentiable backbone that jointly predicts bi-directional optical flow and camera pose updates, coupled with a differentiable symmetric epipolar distance (SED) solver. A dense pre-conditioning step and iterative update operator enable accurate wide-baseline two-view pose estimation and seamless visual odometry, while trajectory alignment via Sim(3) and NetVLAD-based retrieval unifies disjoint sequences into a single global frame. The approach is validated on real-world datasets (EuRoC-MAV, ETH3D, Scannet, Megadepth), showing superior accuracy and robustness to catastrophic failures compared to state-of-the-art monocular multi-session methods, and competitive two-view pose results against transformer-based matchers. The work demonstrates that end-to-end differentiable backbones with recurrent pose/motion solvers can effectively integrate cross-session matching, pose refinement, and global optimization in monocular SLAM, enabling robust multi-session mapping in challenging scenarios.

Abstract

We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose
Paper Structure (18 sections, 28 equations, 13 figures, 4 tables)

This paper contains 18 sections, 28 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our method estimates camera pose from multiple disconnected video streams.
  • Figure 2: Overview of our backbone. Given a group of $\geq 2$ frames, our method jointly estimates bi-directional optical flow and camera poses. This module is applicable to (1) wide-baseline two-view matching and (2) visual odometry. We use our backbone for Multi-Session SLAM, which requires the ability to perform both (1) and (2). For each image, we select a set of 2D anchor points and initialize their depth and the camera poses trivially. Our approach then iteratively refines the matches for each anchor, similar to RAFT teed2020raft, while updating the camera poses. We alternate between matching and pose updates, where each one informs the update to the other. This entire procedure is repeated several times until convergence.
  • Figure 3: A single update iteration (two-view). For each anchor-match pair, we predict an update to the matches using the RNN. We then solve for an update to the camera poses which minimizes the symmetric epipolar distance (SED), producing a new set of epipolar lines. Finally, we clamp the matches back to the best-fit epipolar lines, and repeat the whole process again. In Fig. \ref{['fig:qual_iters']} and the Appendix, we visualize these iterations on real-world images.
  • Figure 4: The RNN Module. For each anchor-match pair $(a_k, m_{kj})$, it predicts an update to $m_{kj}$ and an associated confidence $w_{kj}$. It also updates a hidden state $h_{kj} \in \mathbb{R}^{384}$. Internally, the RNN shares features via attention between updates with the same incoming and outgoing frame.
  • Figure 5: The flow of gradients through our solver. The updated matches from the RNN are supervised directly, and then detached from the gradient tape. The solver output is supervised with a pose loss. In the backward pass, gradients from the pose loss are used to supervise the confidence head in order to learn outlier rejection.
  • ...and 8 more figures