Table of Contents
Fetching ...

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

Kasun Weerakoon, Mohamed Elnoor, Gershom Seneviratne, Vignesh Rajagopal, Senthil Hariharan Arul, Jing Liang, Mohamed Khalid M Jaffar, Dinesh Manocha

TL;DR

BehAV tackles outdoor robot navigation under user-specified behavioral constraints by uniting language-driven instruction decoding with vision-language grounding. It ground behavioral rules into a real-time behavioral cost map and integrates this with a LiDAR occupancy map within an unconstrained MPC planner, enabling simultaneous landmark following and behavior adherence. Key contributions include a novel behavioral cost map representation, a visual landmark estimation pipeline using large VLMs, and a behavior-aware planner with gait-switching for stability. Experimental results on a quadruped platform show substantial improvements in alignment with human teleoperation (Fréchet distance) and navigation success over state-of-the-art baselines, signaling practical impact for safe, instruction-guided outdoor autonomy.

Abstract

We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Frechet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods.

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

TL;DR

BehAV tackles outdoor robot navigation under user-specified behavioral constraints by uniting language-driven instruction decoding with vision-language grounding. It ground behavioral rules into a real-time behavioral cost map and integrates this with a LiDAR occupancy map within an unconstrained MPC planner, enabling simultaneous landmark following and behavior adherence. Key contributions include a novel behavioral cost map representation, a visual landmark estimation pipeline using large VLMs, and a behavior-aware planner with gait-switching for stability. Experimental results on a quadruped platform show substantial improvements in alignment with human teleoperation (Fréchet distance) and navigation success over state-of-the-art baselines, signaling practical impact for safe, instruction-guided outdoor autonomy.

Abstract

We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Frechet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods.
Paper Structure (24 sections, 12 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Autonomous robot navigation with BehAV (ours) for two different behavioral instructions compared to the human preferred path. BehAV decomposes human instructions into behavioral and navigation instructions. The behavioral instructions are used to construct a real-time behavioral cost map to encode the behavioral rules for planning.
  • Figure 2: Overall architecture of BehAV: We decompose human instructions into navigation and behavioral components. Navigation instructions identify landmarks and goals. Behavioral instructions are split into actions ($\mathcal{A}^{\text{behav}}$) and objects ($\mathcal{L}^{\text{behav}}$). An LLM evaluates action desirability, assigning probabilities to each action. A lightweight vision-language model (CLIPSeg luddecke2022clipseg) generates real-time segmentation maps for behavioral objects. Combining action probabilities with segmentation maps yields a real-time behavioral cost map encoding the instructions. A local planner uses this cost map to navigate toward landmarks while respecting behavioral constraints.
  • Figure 3: Cost maps generated by BehAV for diverse instructions: (a)"Follow the sidewalk, stay away from grass, and avoid cyclists"; (b)"Stay on the sand, stay away from grass, and avoid water puddles"; (c)"Stay on the sidewalk, follow the crosswalk and stop for stop hand gesture"; (d)"Stay on concrete, avoid grass and stop for stop sign"; (e)"Stay on tiles, and use caution to follow stairs"; (f)"Follow the concrete, stay away from grass, and stop for people wearing red shirts"; (g)"Stay on concrete, stay away from grass and yield to people wearing black shirts"; (h)"Stay on concrete, stay away from the grass"; (i)"Stay on grass, stay away from the concrete". The color map is shown on the right side.
  • Figure 4: Landmark goal detection using various VLM models compared to the ground truth centroid (blue) across diverse scenes. Predictions from GPT4o (green), GPT4v (red), and Gemini (cyan) are shown, with orange bounding boxes highlighting the landmarks.
  • Figure 5: Robot trajectories when navigating in diverse outdoor scenes using various behavioral instructions. BehAV can demonstrate diverse behaviors by simply changing the input instructions as desired.