Table of Contents
Fetching ...

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

Yu Li, Dayou Li, Chenkun Zhao, Ruifeng Wang, Ran Song, Wei Zhang

TL;DR

This work addresses language-driven robotic goal navigation by constructing a Visual Scene Representation (VSR) from large pre-trained models. It combines SAM-based segmentation and CLIP embeddings to build object-level features with coordinates, then uses an active coverage path and a Hamiltonian-path formulation to autonomously collect a scene map. An LLM translates natural-language instructions into atomic tasks that operate on the VSR via well-defined APIs, enabling open-vocabulary object querying and manipulation. Experiments on a real RM65 platform show strong open-vocabulary retrieval and instruction-following performance, highlighting the approach's potential for robust, language-guided robotic assistance.

Abstract

To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

TL;DR

This work addresses language-driven robotic goal navigation by constructing a Visual Scene Representation (VSR) from large pre-trained models. It combines SAM-based segmentation and CLIP embeddings to build object-level features with coordinates, then uses an active coverage path and a Hamiltonian-path formulation to autonomously collect a scene map. An LLM translates natural-language instructions into atomic tasks that operate on the VSR via well-defined APIs, enabling open-vocabulary object querying and manipulation. Experiments on a real RM65 platform show strong open-vocabulary retrieval and instruction-following performance, highlighting the approach's potential for robust, language-guided robotic assistance.

Abstract

To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.
Paper Structure (15 sections, 2 equations, 5 figures, 2 tables)

This paper contains 15 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: With visual scene representation and LLMs, the robot can locate objects with language descriptions and interpret the language instruction into action sequence to carry out the tasks by calling appropriate APIs.
  • Figure 2: System overview. The visual scene representation is built during the active coverage process as the robot moves to cover the environment. It merges the object visual feature and location of objects into the construction. Once the construction is complete, the robot can use the large language model to break down language instruction into an action sequence and locate the target object with textual description.
  • Figure 3: Active Coverage Algorithm. (a) is the original 2D map. (b) shows result of boundary extraction, with contours drawn in purple. (c) shows coverage path, with blue dot denoting the starting point and the red arrow indicating the trajectory.
  • Figure 4: Sample of executing action sequence. For case 1, the robot is asked to "throw the coke can into the dustbin". For case 2, the robot is aksed "I want to go downstairs, can you help". With pre-trained models, the robot is able to comprehend complex instructions.
  • Figure 5: Comparison of different zero-shot language-image matching methods. The results are indicated by a red cross for incorrect matches and a green check mark for correct matches. The segment-based methods are presented in red bounding boxes, while the detection-based methods are shown with a transparent background in contrast to the segmented results.