Table of Contents
Fetching ...

MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang

TL;DR

The paper tackles language-grounded semantic navigation and the lack of a language-centric evaluation framework. It introduces LangNav, an open-vocabulary dataset with fine-grained linguistic annotations, and LaMoN, a sequential language-described multi-object navigation task. It then presents MLFM, a multi-layer semantic map built from vision-language features, with querying variants and a two-phase navigation strategy to perform zero-shot grounding and navigation. Experiments show that MLFM variants outperform state-of-the-art 2D mapping baselines on LangNav and GOAT-Bench, particularly in grounding fine-grained attributes and spatial relations, though texture understanding remains challenging and there is room for improving open-set generalization.

Abstract

Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.

MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

TL;DR

The paper tackles language-grounded semantic navigation and the lack of a language-centric evaluation framework. It introduces LangNav, an open-vocabulary dataset with fine-grained linguistic annotations, and LaMoN, a sequential language-described multi-object navigation task. It then presents MLFM, a multi-layer semantic map built from vision-language features, with querying variants and a two-phase navigation strategy to perform zero-shot grounding and navigation. Experiments show that MLFM variants outperform state-of-the-art 2D mapping baselines on LangNav and GOAT-Bench, particularly in grounding fine-grained attributes and spatial relations, though texture understanding remains challenging and there is room for improving open-set generalization.

Abstract

Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.

Paper Structure

This paper contains 31 sections, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Language-guided Multi-Object Navigation (LaMoN) requires an agent to navigate to multiple goals, described using descriptions (left). We evaluate fine-grained language understanding by tagging each description with attributes (color) and spatial relations (support) (right). There may be multiple positive matches (objects matching all attributes and relations in the instruction) and the agent is scored correct if it stops at any (check marks a match and cross marks a non-match).
  • Figure 2: Method. (a) The agent takes as input the RGB image from which the map building (b) extracts learned visual embeddings and projects onto layers using the depth and camera pose inputs. The map is then queried based on the input instruction by employing one of three techniques (c)--vanilla, VLM or RGraph. Once the agent identifies a possible goal location, it navigates to it using a path planner (d). The agent activates the object detector as an additional signal during the initial phase (EAE) of the navigation.
  • Figure 3: Comparisons showing that VLM struggles to reason on projected abstract features, often interpreting them as egocentric views (example 1). RGraph struggles distinguishing 'inside' from 'above' when both objects are projected onto the same map layer (example 3).
  • Figure 4: Language descriptions in Goat-Bench contain errors propagated from the BLIP-2 model. This figure shows examples for partial match, hallucination, mesh artifact, spatial error and reference to object bounding box errors.
  • Figure 5: In LangNav, we use objects and attributes available in HSSD synthetic scenes, thus producing error free language descriptions.