Table of Contents
Fetching ...

ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers

TL;DR

ViSTA-SLAM tackles intrinsics-free, real-time monocular dense SLAM by integrating a lightweight symmetric two-view association frontend with a $Sim(3)$ pose-graph backend and loop closure. The STA frontend regresses local point maps and a relative pose from two RGB images using a single decoder, enabled by symmetry, while training enforces cycle and geometric consistency. The backend constructs a pose graph with pose and scale edges, integrates loop closures via BoW, and optimizes with Levenberg–Marquardt on the Lie group, yielding state-of-the-art trajectory accuracy and dense reconstruction on 7-Scenes and TUM-RGBD. This approach achieves superior performance with a significantly smaller frontend, enabling robust, real-time dense SLAM across diverse camera setups and scenes, and suggesting strong potential for broad applicability in robotics and AR/VR.

Abstract

We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam

ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

TL;DR

ViSTA-SLAM tackles intrinsics-free, real-time monocular dense SLAM by integrating a lightweight symmetric two-view association frontend with a pose-graph backend and loop closure. The STA frontend regresses local point maps and a relative pose from two RGB images using a single decoder, enabled by symmetry, while training enforces cycle and geometric consistency. The backend constructs a pose graph with pose and scale edges, integrates loop closures via BoW, and optimizes with Levenberg–Marquardt on the Lie group, yielding state-of-the-art trajectory accuracy and dense reconstruction on 7-Scenes and TUM-RGBD. This approach achieves superior performance with a significantly smaller frontend, enabling robust, real-time dense SLAM across diverse camera setups and scenes, and suggesting strong potential for broad applicability in robotics and AR/VR.

Abstract

We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam

Paper Structure

This paper contains 23 sections, 17 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: ViSTA-SLAM Results on a Multi-room Scene dai2017scannet. By combining the proposed lightweight frontend Symmetric Two-view Association (STA) model with $\mathrm{Sim(3)}$ pose graph optimization and loop closuring as the backend, ViSTA-SLAM achieves high-quality reconstruction and accurate trajectory estimation on challenging scenes while running in real time.
  • Figure 2: ViSTA-SLAM Overview. Given sequential video frames without intrinsics as the input, our frontend model takes in view pairs and predicts local pointmaps and relative poses within each pair. We then use the pair-wise predictions to construct a $\mathrm{Sim(3)}$ pose graph with loop closure and optimize it via Levenberg–Marquardt algorithm. The frontend model employs a fully symmetric design, making the model lightweight and supporting more flexible pose graph optimization. The blue edges in the pose graph and final results represent connections between neighboring nodes (views), while the orange edges correspond to loop closures.
  • Figure 3: Asymmetric vs. Symmetric Architectures. Asymmetric architectures dust3rmast3r use two decoders to regress point maps in a shared coordinate space. our symmetric formulation regresses relative pose and local point maps with only a single decoder, reducing over 36% of the parameters ($\sim$ 0.4 vs. 0.7 billion), while achieving higher accuracy and enabling pose graph optimization in the backend.
  • Figure 5: Trajectory estimation results on 7-Scenes office (top) and TUM-RGBD room (bottom). Estimated camera trajectories are projected onto the $x$–$y$ plane, with ground-truth shown as dashed lines. The trajectory color encodes ATE RMSE: higher errors in red, lower in blue. For MASt3R-SLAM murai2025mast3rslam and VGGT-SLAM maggio2025vggtslam, only the poses of their selected keyframes are estimated.
  • Figure 6: Reconstruction results on 7-Scenes redkitchen (left), TUM-RGBD room (middle), and BundleFusion apt1 (right). Purple boxes highlight reconstruction artifacts near the edges (background points wrongly mapped to the edge of the foreground). Red boxes indicate misalignments. Green boxes highlights ViSTA-SLAM's competitive results. VGGT-SLAM fails to complete reconstruction on apt1 due to divergence in pose graph optimization.
  • ...and 3 more figures