Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation
Yu Li, Dayou Li, Chenkun Zhao, Ruifeng Wang, Ran Song, Wei Zhang
TL;DR
This work addresses language-driven robotic goal navigation by constructing a Visual Scene Representation (VSR) from large pre-trained models. It combines SAM-based segmentation and CLIP embeddings to build object-level features with coordinates, then uses an active coverage path and a Hamiltonian-path formulation to autonomously collect a scene map. An LLM translates natural-language instructions into atomic tasks that operate on the VSR via well-defined APIs, enabling open-vocabulary object querying and manipulation. Experiments on a real RM65 platform show strong open-vocabulary retrieval and instruction-following performance, highlighting the approach's potential for robust, language-guided robotic assistance.
Abstract
To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.
