Collision avoidance from monocular vision trained with novel view synthesis
Valentin Tordjman--Levavasseur, Stéphane Caron
TL;DR
This work tackles collision avoidance using monocular RGB inputs and an implicit scene representation learned via novel view synthesis. A two-stage pipeline trains a visual encoder on synthetic depth-like targets and a separate policy that outputs joystick corrections, which are executed by a model-predictive locomotion controller. The approach demonstrates repeatable collision-avoidance behavior in a training environment and to some extent in out-of-distribution settings, though outdoor generalization remains challenging. The method offers a lightweight alternative to explicit scene models, enabling real-time operation on modest hardware. Overall, it highlights the potential and current limits of vision-driven collision avoidance with implicit representations for mobile robots.
Abstract
Collision avoidance can be checked in explicit environment models such as elevation maps or occupancy grids, yet integrating such models with a locomotion policy requires accurate state estimation. In this work, we consider the question of collision avoidance from an implicit environment model. We use monocular RGB images as inputs and train a collisionavoidance policy from photorealistic images generated by 2D Gaussian splatting. We evaluate the resulting pipeline in realworld experiments under velocity commands that bring the robot on an intercept course with obstacles. Our results suggest that RGB images can be enough to make collision-avoidance decisions, both in the room where training data was collected and in out-of-distribution environments.
