Table of Contents
Fetching ...

Collision avoidance from monocular vision trained with novel view synthesis

Valentin Tordjman--Levavasseur, Stéphane Caron

TL;DR

This work tackles collision avoidance using monocular RGB inputs and an implicit scene representation learned via novel view synthesis. A two-stage pipeline trains a visual encoder on synthetic depth-like targets and a separate policy that outputs joystick corrections, which are executed by a model-predictive locomotion controller. The approach demonstrates repeatable collision-avoidance behavior in a training environment and to some extent in out-of-distribution settings, though outdoor generalization remains challenging. The method offers a lightweight alternative to explicit scene models, enabling real-time operation on modest hardware. Overall, it highlights the potential and current limits of vision-driven collision avoidance with implicit representations for mobile robots.

Abstract

Collision avoidance can be checked in explicit environment models such as elevation maps or occupancy grids, yet integrating such models with a locomotion policy requires accurate state estimation. In this work, we consider the question of collision avoidance from an implicit environment model. We use monocular RGB images as inputs and train a collisionavoidance policy from photorealistic images generated by 2D Gaussian splatting. We evaluate the resulting pipeline in realworld experiments under velocity commands that bring the robot on an intercept course with obstacles. Our results suggest that RGB images can be enough to make collision-avoidance decisions, both in the room where training data was collected and in out-of-distribution environments.

Collision avoidance from monocular vision trained with novel view synthesis

TL;DR

This work tackles collision avoidance using monocular RGB inputs and an implicit scene representation learned via novel view synthesis. A two-stage pipeline trains a visual encoder on synthetic depth-like targets and a separate policy that outputs joystick corrections, which are executed by a model-predictive locomotion controller. The approach demonstrates repeatable collision-avoidance behavior in a training environment and to some extent in out-of-distribution settings, though outdoor generalization remains challenging. The method offers a lightweight alternative to explicit scene models, enabling real-time operation on modest hardware. Overall, it highlights the potential and current limits of vision-driven collision avoidance with implicit representations for mobile robots.

Abstract

Collision avoidance can be checked in explicit environment models such as elevation maps or occupancy grids, yet integrating such models with a locomotion policy requires accurate state estimation. In this work, we consider the question of collision avoidance from an implicit environment model. We use monocular RGB images as inputs and train a collisionavoidance policy from photorealistic images generated by 2D Gaussian splatting. We evaluate the resulting pipeline in realworld experiments under velocity commands that bring the robot on an intercept course with obstacles. Our results suggest that RGB images can be enough to make collision-avoidance decisions, both in the room where training data was collected and in out-of-distribution environments.

Paper Structure

This paper contains 25 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Effect of the vision-based collision-avoidance policy when the commanded velocity prompts the robot to collide with a wall. Blue: joystick user input, kept stationary at full forward throttle. Green: trajectory actually followed by the robot after compensation by the policy.
  • Figure 2: Monocular obstacle avoidance pipeline from the perception to the joint commands.
  • Figure 3: Comparison between the raw mesh export and the mesh after being processed by CoACD, and manually cleaned up.
  • Figure 4: Most significant corrections applied by the collision-avoidance policy in the pure navigation environment when prompted to go fully forward. The most significant corrections are applied near the walls, away from them.
  • Figure 5: Examples of depth reconstruction. On the first line is the RGB image given to the encoder, on the second line the depth output of the decoder.
  • ...and 1 more figures