Table of Contents
Fetching ...

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

Bangguo Yu, Yuzhen Liu, Lei Han, Hamidreza Kasaei, Tingguang Li, Ming Cao

TL;DR

The proposed VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively, constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment.

Abstract

Following human instructions to explore and search for a specified target in an unfamiliar environment is a crucial skill for mobile service robots. Most of the previous works on object goal navigation have typically focused on a single input modality as the target, which may lead to limited consideration of language descriptions containing detailed attributes and spatial relationships. To address this limitation, we propose VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively. To be more precise, our approach constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment. Then, the framework identifies the most promising areas to explore in search of potential target candidates. A game-theoretic vision language model is employed to determine which target best matches the given language description. Experiments conducted on the Habitat-Matterport 3D (HM3D) dataset demonstrate that the proposed framework achieves state-of-the-art performance in both object goal navigation and language-based navigation tasks. Moreover, we show that VLN-Game can be easily deployed on real-world robots. The success of VLN-Game highlights the promising potential of using game-theoretic methods with compact vision-language models to advance decision-making capabilities in robotic systems. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/vln-game.

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

TL;DR

The proposed VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively, constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment.

Abstract

Following human instructions to explore and search for a specified target in an unfamiliar environment is a crucial skill for mobile service robots. Most of the previous works on object goal navigation have typically focused on a single input modality as the target, which may lead to limited consideration of language descriptions containing detailed attributes and spatial relationships. To address this limitation, we propose VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively. To be more precise, our approach constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment. Then, the framework identifies the most promising areas to explore in search of potential target candidates. A game-theoretic vision language model is employed to determine which target best matches the given language description. Experiments conducted on the Habitat-Matterport 3D (HM3D) dataset demonstrate that the proposed framework achieves state-of-the-art performance in both object goal navigation and language-based navigation tasks. Moreover, we show that VLN-Game can be easily deployed on real-world robots. The success of VLN-Game highlights the promising potential of using game-theoretic methods with compact vision-language models to advance decision-making capabilities in robotic systems. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/vln-game.

Paper Structure

This paper contains 31 sections, 13 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visual target navigation example. The robot identifies the chair that best matches the description of the instruction in such a complex and unstructured environment using game-theoretic vision language models.
  • Figure 2: This framework utilizes posed RGB-D frames to generate a 3D object-centric map and an exploration map for robot navigation. Target descriptions are parsed by LLM to set the primary navigation goals. Using CLIP-based similarity assessments, the system evaluates the relevance between the target and environmental features to direct exploration activities. Upon detecting a potential target, a game-theoretic vision-language model analyzes spatial relationships described in the target instructions. Achievement of the long-term goal or target identification triggers a local policy that dictates the robot's final actions.
  • Figure 3: The process of the map building. The framework takes RGB-D images as input to generate a 3D object-centric map, in which each object in this map is represented as a group of point clouds and CLIP features. Based on the CLIP text encoder, the similarity map between our target and the objects in the map can be calculated. The exploration map is also obtained from the projection of the 3D point cloud map, which can be used to generate the frontiers and plan the paths.
  • Figure 4: A case of identifying a white desk in the scene. There are two candidate targets detected during the navigation. Based on the multi-view images and the language description of the target, the vision language model can output the target's ID from the candidate list that most matches the description.
  • Figure 5: A case of Nash equilibrium search based on vision language models. After getting all candidate targets, the Generator and Discriminator can be used to infer a coherent result.
  • ...and 4 more figures