Table of Contents
Fetching ...

Words to Wheels: Vision-Based Autonomous Driving Understanding Human Language Instructions Using Foundation Models

Chanhoe Ryu, Hyunki Seong, Daegyu Lee, Seongwoo Moon, Sungjae Min, D. Hyunchul Shim

TL;DR

This paper introduces an innovative application of foundation models, enabling Unmanned Ground Vehicles (UGVs) equipped with an RGB-D camera to navigate to designated destinations based on human language instructions, thus facilitating generalization to novel environments.

Abstract

This paper introduces an innovative application of foundation models, enabling Unmanned Ground Vehicles (UGVs) equipped with an RGB-D camera to navigate to designated destinations based on human language instructions. Unlike learning-based methods, this approach does not require prior training but instead leverages existing foundation models, thus facilitating generalization to novel environments. Upon receiving human language instructions, these are transformed into a 'cognitive route description' using a large language model (LLM)-a detailed navigation route expressed in human language. The vehicle then decomposes this description into landmarks and navigation maneuvers. The vehicle also determines elevation costs and identifies navigability levels of different regions through a terrain segmentation model, GANav, trained on open datasets. Semantic elevation costs, which take both elevation and navigability levels into account, are estimated and provided to the Model Predictive Path Integral (MPPI) planner, responsible for local path planning. Concurrently, the vehicle searches for target landmarks using foundation models, including YOLO-World and EfficientViT-SAM. Ultimately, the vehicle executes the navigation commands to reach the designated destination, the final landmark. Our experiments demonstrate that this application successfully guides UGVs to their destinations following human language instructions in novel environments, such as unfamiliar terrain or urban settings.

Words to Wheels: Vision-Based Autonomous Driving Understanding Human Language Instructions Using Foundation Models

TL;DR

This paper introduces an innovative application of foundation models, enabling Unmanned Ground Vehicles (UGVs) equipped with an RGB-D camera to navigate to designated destinations based on human language instructions, thus facilitating generalization to novel environments.

Abstract

This paper introduces an innovative application of foundation models, enabling Unmanned Ground Vehicles (UGVs) equipped with an RGB-D camera to navigate to designated destinations based on human language instructions. Unlike learning-based methods, this approach does not require prior training but instead leverages existing foundation models, thus facilitating generalization to novel environments. Upon receiving human language instructions, these are transformed into a 'cognitive route description' using a large language model (LLM)-a detailed navigation route expressed in human language. The vehicle then decomposes this description into landmarks and navigation maneuvers. The vehicle also determines elevation costs and identifies navigability levels of different regions through a terrain segmentation model, GANav, trained on open datasets. Semantic elevation costs, which take both elevation and navigability levels into account, are estimated and provided to the Model Predictive Path Integral (MPPI) planner, responsible for local path planning. Concurrently, the vehicle searches for target landmarks using foundation models, including YOLO-World and EfficientViT-SAM. Ultimately, the vehicle executes the navigation commands to reach the designated destination, the final landmark. Our experiments demonstrate that this application successfully guides UGVs to their destinations following human language instructions in novel environments, such as unfamiliar terrain or urban settings.

Paper Structure

This paper contains 25 sections, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the Words to Wheels.
  • Figure 2: The overall pipeline for vision-based autonomous driving, which interprets human language instructions using foundation models, encompasses several crucial steps. Initially, unstructured human language instructions are translated into a cognitive route description, which is then parsed into a set of maneuvers and landmarks. Depth images are utilized to generate point clouds, while RGB images are processed for navigability level segmentation using GANav. These elements contribute to creating semantic images, which, combined with point clouds, help to produce a semantic elevation cost map accounting for both semantic and elevation costs. This cost map is processed by a Model Predictive Path Integral (MPPI) planner for local planning, resulting in the output of control commands. The maneuvers are refined using the semantic elevation cost map to produce the desired actions. As the vehicle progresses, landmarks are detected and segmented using YOLO-World and EfficientViT-SAM, enabling the vehicle to autonomously verify its arrival at each landmark, seek the next one, and ultimately reach the final destination.
  • Figure 3: (a) Paragraphs extracted from 'The description of routes: a cognitive approach to the production of spatial discourse.' illustrate the steps to successfully describe routes (pp. 420-421) denis1997description. (b) Visualization of cognitive route description.
  • Figure 4: (a) Visualization of Gaussian distribution inspired penalty to induce left turn maneuver, (b) Updated cost map promoting left turn by penalizing right side of the cost map.
  • Figure 5: Hardware design of an unmanned ground vehicle (32"$\times$20"$\times$16"), Traxxas buggy car platform, includes: (a) Nvidia Jetson Orin NX, (b) Nvidia Jetson Orin AGX, (c) NETGEAR GS108E, (d) Intel NUC 11 Pro i7, (e) Intel RealSense D455, (f) Teensey 4.1 Development Board, (g) Futaba T18SZ
  • ...and 2 more figures