Table of Contents
Fetching ...

Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

Abdelmoamen Nasser, Yousef Baba'a, Murad Mebrahtu, Nadya Abdel Madjid, Jorge Dias, Majid Khonji

Abstract

Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.

Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

Abstract

Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.

Paper Structure

This paper contains 15 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the off-road framework: (i) Simulation Module using NVIDIA Isaac Sim, (ii) Segmentation Module with SAM2 to for mask generation and tracking, (iii) VLM Inference Module to detect drivable area based on segmented masks and textual prompts, (iv) Mapping, planning, and control for environment mapping, path generation, and path following and velocity tracking respectively.
  • Figure 2: Simulation Setup: Bottom left: Simulation environment created in Unreal Engine. Top left: Simulation environment imported into Isaac Sim. Center: Top view of the open trail created using the spline creation tools, where the letters (A,B and C) indicate the goals in the reachability test. Right: Polaris RZR Sport 2022 imported using PhysX vehicle api.
  • Figure 3: The pre-processing interface at three states from left to right. (i) The initial state. (ii) After addition of drivable masks 4, 5 and 6. (iii) After subtraction of masks 6 and 7.
  • Figure 4: Qualitative comparison of Mask Generation vs Point Prompting on frame 88.
  • Figure 5: Samples of drivable areas detected by evaluated VLMs. GT refers to Ground Truth, while the remaining columns display the outputs of each model: ChatGPT-4o, Aquila, Ivy-VL and MiniCPM.
  • ...and 1 more figures