Table of Contents
Fetching ...

Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs

Faraz Lotfi, Farnoosh Faraji, Nikhil Kakodkar, Travis Manderson, David Meger, Gregory Dudek

TL;DR

This work tackles map-free off-road navigation by leveraging high-level verbal instructions parsed by Whisper and an LLM to identify landmarks, preferred terrains, and adverbs that map to speed constraints. It introduces a language-driven semantic segmentation approach (LSeg) to ground terrain and landmark information without prior maps, feeding into a nonlinear MPC-based local planner with adverb constraints, and uses a moving horizon estimator for robust state estimation. Empirical results show LSeg outperforms ConceptFusion on larger segmentation regions, with RC-car and Unreal Engine experiments validating the approach. Ablation studies demonstrate that including adverbs and terrain preferences significantly reduces navigation failures, highlighting the practical potential of data-light, instruction-driven constrained navigation in diverse environments.

Abstract

This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation. A language-driven semantic segmentation model generates text-based masks for identifying landmarks and terrain types in images. By translating 2D image points to the vehicle's motion plane using camera parameters, an MPC controller can guides the vehicle towards the desired terrain. This approach enhances adaptation to diverse environments and facilitates the use of high-level instructions for navigating complex and challenging terrains.

Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs

TL;DR

This work tackles map-free off-road navigation by leveraging high-level verbal instructions parsed by Whisper and an LLM to identify landmarks, preferred terrains, and adverbs that map to speed constraints. It introduces a language-driven semantic segmentation approach (LSeg) to ground terrain and landmark information without prior maps, feeding into a nonlinear MPC-based local planner with adverb constraints, and uses a moving horizon estimator for robust state estimation. Empirical results show LSeg outperforms ConceptFusion on larger segmentation regions, with RC-car and Unreal Engine experiments validating the approach. Ablation studies demonstrate that including adverbs and terrain preferences significantly reduces navigation failures, highlighting the practical potential of data-light, instruction-driven constrained navigation in diverse environments.

Abstract

This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation. A language-driven semantic segmentation model generates text-based masks for identifying landmarks and terrain types in images. By translating 2D image points to the vehicle's motion plane using camera parameters, an MPC controller can guides the vehicle towards the desired terrain. This approach enhances adaptation to diverse environments and facilitates the use of high-level instructions for navigating complex and challenging terrains.
Paper Structure (6 sections, 10 figures)

This paper contains 6 sections, 10 figures.

Figures (10)

  • Figure 1: The intended offroad vehicle
  • Figure 2: The overall block diagram of the proposed navigation system
  • Figure 3: Engineering an effective prompt to elicit the desired output from a Large Language Model (LLM).
  • Figure 4: This figure depicts the performance of two different models on an unseen dataset of offroad scenes. The left plot illustrates the distribution of data based on the percentage of coverage of the truth segment throughout the entire image, while the right plot displays the results for the dice metric. As evident from the results, LSeg demonstrates improved performance with larger regions of interest, while Conceptfusion appears to struggle in this particular application.
  • Figure 5: These plots depict the results obtained from evaluating the models on a dataset that features more samples with larger true segments. Crucially, this dataset was acquired using a front-facing camera mounted on our RC car, navigating through challenging off-road terrain.
  • ...and 5 more figures