Table of Contents
Fetching ...

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha

TL;DR

CoNVOI introduces a context-aware navigation framework that leverages Vision Language Models (VLMs) to produce context-consistent reference trajectories for indoor and outdoor robot navigation. The core innovations are a context-based prompting mechanism and a multi-modal visual marking scheme that ground VLM attention to obstacle-free regions and their map-relative locations. By integrating the VLM-derived reference path with a traditional motion planner and employing path extrapolation to limit query frequency, CoNVOI achieves human-like navigation behaviors without domain-specific training. Experimental results on real robots demonstrate strong alignment with human teleoperation and significant reductions in unsafe paths and unnecessary VLM queries, highlighting practical potential and current limitations related to latency and remote execution of large models.

Abstract

We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM's attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot's environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e.g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios.

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

TL;DR

CoNVOI introduces a context-aware navigation framework that leverages Vision Language Models (VLMs) to produce context-consistent reference trajectories for indoor and outdoor robot navigation. The core innovations are a context-based prompting mechanism and a multi-modal visual marking scheme that ground VLM attention to obstacle-free regions and their map-relative locations. By integrating the VLM-derived reference path with a traditional motion planner and employing path extrapolation to limit query frequency, CoNVOI achieves human-like navigation behaviors without domain-specific training. Experimental results on real robots demonstrate strong alignment with human teleoperation and significant reductions in unsafe paths and unnecessary VLM queries, highlighting practical potential and current limitations related to latency and remote execution of large models.

Abstract

We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM's attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot's environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e.g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios.
Paper Structure (22 sections, 6 equations, 4 figures, 2 tables)

This paper contains 22 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: [Top]: The trajectories of a Spot robot crossing the road when using CoNVOI with GPT-4v gpt-4v (in green), CoNVOI with Gemini gemini (in purple), teleoperated by a human (in red), GA-Nav guan2021ganav, and DWA DWA. CoNVOI navigates the robot on the crosswalk by understanding the environmental context. [Bottom]: The trajectories when the Spot robot navigates to a goal beyond a blockade. While using CoNVOI, handling such scenarios can be added on the fly using a simple text prompt without any reformulation. The RGB images from the robot are shown with CoNVOI's multi-modal visual marking (numbers in yellow). GPT-4v/Gemini is queried with the marked image and a context-based text prompt in quotes, and it returns the green reference path that follows explicit, and implicit social rules, and human-like preferences during indoor and outdoor navigation.
  • Figure 2: CoNVOI's architecture utilizes CLIP to interpret the context of the robot's environment from an RGB image ($I^{RGB}_t$), identifying features such as indoor corridors, social scenarios with people, outdoor terrains, etc. Next, CoNVOI queries a large VLM with a context-based text prompt, and the RGB image marked with numbers ($I^{Mark}_t$) in the free space detected in an occupancy grid map to generate a reference path (in green) that adheres to explicit and implicit social rules (e.g., staying on pavement, using crosswalks). A dedicated motion planner then follows this path while avoiding obstacles. To prevent unnecessarily requerying the large VLM, we extrapolate this reference path linearly and check if the extrapolated point (in yellow) lies on a paved path (sidewalk, corridor, etc) to either use it for navigation, or requery the VLM again. Instead of using VLMs for direct robot control or for well-addressed tasks such as goal-reaching and obstacle avoidance, we leverage its context-understanding capabilities to achieve more intricate, zero-shot navigation behaviors using a separate planner.
  • Figure 3: Robot trajectories when navigating in different complex indoor and outdoor environments using various methods: CoNVOI (in green), teleoperated by a human (in red), DWA DWA (in blue), Frozone frozone (in violet), GA-Nav guan2021ganav (in orange). CoNVOI exhibits social-compliant behaviors such as not moving in-between humans even if there is sufficient space, similar to Frozone, a method formulated for indoor social navigation. CoNVOI's behaviors also match GA-Nav, a semantic segmentation-based navigation approach that prefers to navigate on smooth, well-paved outdoor terrains. CoNVOI achieves these behaviors in a zero-shot manner, and does not require domain-specific fine-tuning. CoNVOI's trajectories also closely match human-teleoperated ground truth paths (in red).
  • Figure 4: Qualitative comparison of the reference trajectory generated by GPT-4v gpt-4v (in green), and Gemini gemini (in orange) compared with human provided ground truth (in blue) by connecting the marked numbers (in yellow). We observe that the reference paths generated by the VLMs is comparable to that of the human-preferred path in many complex indoor and outdoor environments.