HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM
Ziren Gong, Fabio Tosi, Youmin Zhang, Stefano Mattoccia, Matteo Poggi
TL;DR
HS-SLAM addresses key challenges in dense SLAM by integrating a hybrid scene representation that fuses hash grids, tri-planes, and one-blob encodings, enabling complete and textured scene reconstructions. It introduces non-local patch-based structural supervision and an active global bundle adjustment to maintain global consistency during significant motion or forgetting. Across Replica, ScanNet, and TUM RGB-D, HS-SLAM outperforms state-of-the-art NeRF-centric baselines in both tracking accuracy and reconstruction quality while remaining computationally efficient for robotics. The approach offers a practical balance between accuracy and speed, with potential extensions to submaps and loop closure for larger-scale environments.
Abstract
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
