An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World
Xingwu Ji, Haochen Niu, Dexin Duan, Rendong Ying, Fei Wen, Peilin Liu
TL;DR
The paper tackles the generalization challenge of learning-based monocular depth and visual odometry in open-world settings by introducing an online self-supervised adaptation framework. It couples a pre-trained depth model equipped with lightweight refiners (R-DepthNet) with a pseudo RGB-D SLAM in a closed loop, enhanced by Sparse Depth Densification and Dynamic Consistency Enhancement to generate pseudo-depths and valid masks during online updates. The approach achieves robust depth and pose estimation across KITTI, TUM, and a mobile robot platform, with online adaptation requiring only a small fraction of trainable parameters and demonstrating fast convergence. This work enables practical deployment of learning-based VO systems in diverse environments by leveraging real-time feedback from SLAM to adapt depth estimation on the fly.
Abstract
Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: https://github.com/jixingwu/SOL-SLAM.
