Table of Contents
Fetching ...

MUVLA: Learning to Explore Object Navigation via Map Understanding

Peilong Han, Fan Jia, Min Zhang, Yutao Qiu, Hongyao Tang, Yan Zheng, Tiancai Wang, Jianye Hao

TL;DR

MUVLA addresses object navigation by unifying historical exploration through semantic map abstractions and learning action values via reward-guided supervision. The approach uses a three-stage training pipeline—map understanding, behavior cloning, and reward amplification—together with cross-modal fusion of semantic maps and dense observation histories. Experiments on HM3D and Gibson show state-of-the-art performance among training-based methods and strong zero-shot generalization, demonstrating robust, efficient exploration in unseen environments. The framework highlights the practical impact of integrating spatial memory, language grounding, and value-driven learning for generalizable embodied navigation. RTG_t = \sum_{k=0}^{K-1} \gamma^k R_{t+k} and related expectile regression enable precise value estimation guiding exploration.

Abstract

In this paper, we present MUVLA, a Map Understanding Vision-Language-Action model tailored for object navigation. It leverages semantic map abstractions to unify and structure historical information, encoding spatial context in a compact and consistent form. MUVLA takes the current and history observations, as well as the semantic map, as inputs and predicts the action sequence based on the description of goal object. Furthermore, it amplifies supervision through reward-guided return modeling based on dense short-horizon progress signals, enabling the model to develop a detailed understanding of action value for reward maximization. MUVLA employs a three-stage training pipeline: learning map-level spatial understanding, imitating behaviors from mixed-quality demonstrations, and reward amplification. This strategy allows MUVLA to unify diverse demonstrations into a robust spatial representation and generate more rational exploration strategies. Experiments on HM3D and Gibson benchmarks demonstrate that MUVLA achieves great generalization and learns effective exploration behaviors even from low-quality or partially successful trajectories.

MUVLA: Learning to Explore Object Navigation via Map Understanding

TL;DR

MUVLA addresses object navigation by unifying historical exploration through semantic map abstractions and learning action values via reward-guided supervision. The approach uses a three-stage training pipeline—map understanding, behavior cloning, and reward amplification—together with cross-modal fusion of semantic maps and dense observation histories. Experiments on HM3D and Gibson show state-of-the-art performance among training-based methods and strong zero-shot generalization, demonstrating robust, efficient exploration in unseen environments. The framework highlights the practical impact of integrating spatial memory, language grounding, and value-driven learning for generalizable embodied navigation. RTG_t = \sum_{k=0}^{K-1} \gamma^k R_{t+k} and related expectile regression enable precise value estimation guiding exploration.

Abstract

In this paper, we present MUVLA, a Map Understanding Vision-Language-Action model tailored for object navigation. It leverages semantic map abstractions to unify and structure historical information, encoding spatial context in a compact and consistent form. MUVLA takes the current and history observations, as well as the semantic map, as inputs and predicts the action sequence based on the description of goal object. Furthermore, it amplifies supervision through reward-guided return modeling based on dense short-horizon progress signals, enabling the model to develop a detailed understanding of action value for reward maximization. MUVLA employs a three-stage training pipeline: learning map-level spatial understanding, imitating behaviors from mixed-quality demonstrations, and reward amplification. This strategy allows MUVLA to unify diverse demonstrations into a robust spatial representation and generate more rational exploration strategies. Experiments on HM3D and Gibson benchmarks demonstrate that MUVLA achieves great generalization and learns effective exploration behaviors even from low-quality or partially successful trajectories.

Paper Structure

This paper contains 24 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: MUVLA focuses on learning efficient exploration strategies by (1) leveraging map-based abstraction to unify diverse and noisy historical trajectories into a robust decision foundation, and (2) learning to evaluate the quality of candidate actions, enabling direct prediction of high-quality actions for efficient navigation.
  • Figure 2: Overview of the MUVLA framework, which integrates semantic maps, observations, and language for efficient object navigation. The three-stage training targets map understanding, behavior cloning, and reward amplification.
  • Figure 3: Data collection workflow and formats for MUVLA’s three-stage training (left), data volume proportion for each stage (top right), and action distribution in the dataset (bottom right; note that the “stop” action has been augmented).