Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models
Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, K. W. Samuel Au
TL;DR
This work tackles interactive robot navigation in environments containing traversable obstacles under natural-language instructions. It combines GPT-3.5 as a large-language model and Grounding-DINO as a vision-language model to extract landmarks and action-aware attributes $P_{a|\ell}$, then builds an action-aware LiDAR costmap for $A^*$ path planning without fine-tuning. Key contributions include introducing action-aware landmarks and a semantic, layered costmap that fuses language-derived semantics with sensor data, and validating the approach on curtains, grasses, and a medical-ward scenario to demonstrate generalization and fast deployment. The results indicate improved navigation flexibility and real-time responsiveness in cluttered environments, enhancing human-robot interaction in healthcare and service robotics.
Abstract
This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
