MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang
TL;DR
The paper tackles language-grounded semantic navigation and the lack of a language-centric evaluation framework. It introduces LangNav, an open-vocabulary dataset with fine-grained linguistic annotations, and LaMoN, a sequential language-described multi-object navigation task. It then presents MLFM, a multi-layer semantic map built from vision-language features, with querying variants and a two-phase navigation strategy to perform zero-shot grounding and navigation. Experiments show that MLFM variants outperform state-of-the-art 2D mapping baselines on LangNav and GOAT-Bench, particularly in grounding fine-grained attributes and spatial relations, though texture understanding remains challenging and there is room for improving open-set generalization.
Abstract
Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
