ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
TL;DR
ViSTA-SLAM tackles intrinsics-free, real-time monocular dense SLAM by integrating a lightweight symmetric two-view association frontend with a $Sim(3)$ pose-graph backend and loop closure. The STA frontend regresses local point maps and a relative pose from two RGB images using a single decoder, enabled by symmetry, while training enforces cycle and geometric consistency. The backend constructs a pose graph with pose and scale edges, integrates loop closures via BoW, and optimizes with Levenberg–Marquardt on the Lie group, yielding state-of-the-art trajectory accuracy and dense reconstruction on 7-Scenes and TUM-RGBD. This approach achieves superior performance with a significantly smaller frontend, enabling robust, real-time dense SLAM across diverse camera setups and scenes, and suggesting strong potential for broad applicability in robotics and AR/VR.
Abstract
We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam
