Table of Contents
Fetching ...

Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models

Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, K. W. Samuel Au

TL;DR

This work tackles interactive robot navigation in environments containing traversable obstacles under natural-language instructions. It combines GPT-3.5 as a large-language model and Grounding-DINO as a vision-language model to extract landmarks and action-aware attributes $P_{a|\ell}$, then builds an action-aware LiDAR costmap for $A^*$ path planning without fine-tuning. Key contributions include introducing action-aware landmarks and a semantic, layered costmap that fuses language-derived semantics with sensor data, and validating the approach on curtains, grasses, and a medical-ward scenario to demonstrate generalization and fast deployment. The results indicate improved navigation flexibility and real-time responsiveness in cluttered environments, enhancing human-robot interaction in healthcare and service robotics.

Abstract

This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.

Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models

TL;DR

This work tackles interactive robot navigation in environments containing traversable obstacles under natural-language instructions. It combines GPT-3.5 as a large-language model and Grounding-DINO as a vision-language model to extract landmarks and action-aware attributes , then builds an action-aware LiDAR costmap for path planning without fine-tuning. Key contributions include introducing action-aware landmarks and a semantic, layered costmap that fuses language-derived semantics with sensor data, and validating the approach on curtains, grasses, and a medical-ward scenario to demonstrate generalization and fast deployment. The results indicate improved navigation flexibility and real-time responsiveness in cluttered environments, enhancing human-robot interaction in healthcare and service robotics.

Abstract

This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
Paper Structure (17 sections, 4 equations, 8 figures, 1 algorithm)

This paper contains 17 sections, 4 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: An example of interactive navigation. When a medicine-delivery dog arrives at a room, a patient can further interact with the robotic dog and ask it to walk to the bed by passing through the traversable curtains. Such an interaction can help navigate the robot to the place where it cannot normally reach and therefore meet humans' real-time needs.
  • Figure 2: Proposed large-model-based interactive navigation framework for the robots in environments with traversable obstacles. (a) The large model module uses LLM and VLM to extract the landmarks' bounding boxes and action-aware attributes from texts/speeches. (b) The action-aware costmap is constructed with the segmented point clouds based on the output of the large model module. (c) A feasible path can be planned with an action-aware costmap, while no feasible path would be found if not considering the landmarks' action-aware attribute.
  • Figure 3: Examples of output of the large model module. The initial instruction for these examples is: “Please go through the curtain and be careful of the chair.”. We obtain the landmarks and the corresponding traversable attributes, curtain and 1 (in orange); chair and 0 (in red), using GPT-3.5. Then, we can ground the related bounding boxes with liu2023grounding.
  • Figure 4: (a) The experimental scenario in bird's eye view and the point clouds obtained from LiDAR Mapping sensor. (b) The yellow areas are the intersection between the bounding boxes of untraversable and traversable objects. (c) The point clouds colored red are the traversable part belonging to the curtain. The rest is the untraversable part belonging to part of the curtain (➊), chairs (➋, ➌), and the environment.
  • Figure 5: (a) We test our interactive navigation system in a simulated in-door hospital ward environment with static obstacles. (b-c) The simulated legged robot can obtain the RGB image and point clouds, respectively.
  • ...and 3 more figures