GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

MoniJesu James; Amir Atef Habel; Aleksey Fedoseev; Dzmitry Tsetserokou

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

MoniJesu James, Amir Atef Habel, Aleksey Fedoseev, Dzmitry Tsetserokou

Abstract

Object-goal navigation has traditionally been limited to ground robots with closed-set object vocabularies. Existing multi-agent approaches depend on precomputed probabilistic graphs tied to fixed category sets, precluding generalization to novel goals at test time. We present GoalVLM, a cooperative multi-agent framework for zero-shot, open-vocabulary object navigation. GoalVLM integrates a Vision-Language Model (VLM) directly into the decision loop, SAM3 for text-prompted detection and segmentation, and SpaceOM for spatial reasoning, enabling agents to interpret free-form language goals and score frontiers via zero-shot semantic priors without retraining. Each agent builds a BEV semantic map from depth-projected voxel splatting, while a Goal Projector back-projects detections through calibrated depth into the map for reliable goal localization. A constraint-guided reasoning layer evaluates frontiers through a structured prompt chain (scene captioning, room-type classification, perception gating, multi-frontier ranking), injecting commonsense priors into exploration. We evaluate GoalVLM on GOAT-Bench val_unseen (360 multi-subtask episodes, 1032 sequential object-goal subtasks, HM3D scenes), where each episode requires navigating to a chain of 5-7 open-vocabulary targets. GoalVLM with N=2 agents achieves 55.8% subtask SR and 18.3% SPL, competitive with state-of-the-art methods while requiring no task-specific training. Ablation studies confirm the contributions of VLM-guided frontier reasoning and depth-projected goal localization.

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Abstract

Paper Structure (40 sections, 14 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 14 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Multi-Agent Navigation
Vision-Language Models for Robotics
Open-Vocabulary Object Navigation
Method
Problem Formulation
Ego-Centric Semantic Mapping
Camera Intrinsic Model
Point Cloud Generation
Voxel Splatting and Height Slicing
Zero-shot Vision-Language Perception
Open-Vocabulary Detection
Spatial Reasoning with VLM
Constraint-Guided Frontier Selection
...and 25 more sections

Figures (5)

Figure 1: Real Experiment of Multi-Agents of GoalVLM.
Figure 2: Multiple agents process vision-language cues, perform local planning, and share semantic maps.
Figure 3: Agents' Exploration and Semantic Mapping.
Figure 4: Per-object category success rate. Transparent/reflective objects (mirror, window glass) and small objects (photo, book) are hardest. Large distinctive objects (refrigerator, piano) achieve $>$72% SR.
Figure 5: Subtask SR by position in the episode sequence. Performance is relatively stable across positions, suggesting limited cascading failure.

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Abstract

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Authors

Abstract

Table of Contents

Figures (5)