Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Mohamed Elnoor; Kasun Weerakoon; Gershom Seneviratne; Ruiqi Xian; Tianrui Guan; Mohamed Khalid M Jaffar; Vignesh Rajagopal; Dinesh Manocha

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Ruiqi Xian, Tianrui Guan, Mohamed Khalid M Jaffar, Vignesh Rajagopal, Dinesh Manocha

TL;DR

This work uses in-context learning to ground the VLM's semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot's real-time physical interactions with the environment, and uses the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning.

Abstract

We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM's semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot's real-time physical interactions with the environment. We use the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning. We validate our method on a legged robot (Ghost Vision 60) and a wheeled robot (Clearpath Husky), in diverse real-world outdoor environments with different deformable and slippery terrains. In practice, we observe significant improvements over state-of-the-art methods by up to 50% increase in navigation success rate.

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 3 figures, 1 table)

This paper contains 22 sections, 9 equations, 3 figures, 1 table.

Introduction
Related Work
Traversability Estimation
Local & Global Path Planning
Foundation and Large Models in Navigation
Background
Setup, Assumptions, and Conventions
Motion Planner
Our Approach
Traversability Estimation using Proprioceptive sensing
Deformability for legged robots
Slippage for wheeled robots
Physically Grounded Reasoning Module
High-Level Global Planner
Adaptive Local Planner
...and 7 more sections

Figures (3)

Figure 1: Overview of our VLM-GroNav system: Our method uses the given information to achieve a navigation objective. We leverage VLMs and aerial imagery to estimate initial terrain traversability. The robot's local exteroceptive and proprioceptive sensors guide the VLM to update the traversability and replan the robot's trajectory.
Figure 2: The VLM-GroNav system employs a reasoning module that integrates visual inputs from aerial imagery, weather conditions, and proprioceptive data through a large VLM to refine terrain traversability estimates and inform navigation decisions. The global planner uses these refined estimates to generate optimal waypoints visually marked on the aerial image to guide the robot toward the goal while achieving the goal objective. The local planner utilizes a frontier-based approach, incorporating real-time proprioceptive feedback to adapt the trajectory dynamically for both legged and wheeled robots for robust navigation.
Figure 3: Comparison of navigation trajectories across various environments using different methods: DWA (Black), GA-Nav (orange), CoNVOI (Dark purple), ViNT (light purple), and our method VLM-GroNav (sky blue). Yellow stat represents the start location and red star represents the goal location. (a) shows Scenario 1, (b) shows Scenario 2, (c) shows Scenario 3, and (d) shows Scenario 4. The top row shows the top-down image of the scene. Top images also contain circles with terrain pictures (1: Grass, 2: Sand, 3: Concrete, 4: Mulch, 5:Muddy Grass, 6:Snow, 7:Muddy Grass). Our Method demonstrates a more direct path with minimal detours.

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

TL;DR

Abstract

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (3)