Table of Contents
Fetching ...

What Matters in RL-Based Methods for Object-Goal Navigation? An Empirical Study and A Unified Framework

Hongze Wang, Boyang Sun, Jiaxu Xing, Fan Yang, Marco Hutter, Dhruv Shah, Davide Scaramuzza, Marc Pollefeys

TL;DR

This paper addresses Object-Goal Navigation by systematically dissecting modular RL pipelines into perception, policy, and test-time enhancement, and comparing their contributions under controlled experiments. It finds perception quality and test-time strategies to be the primary drivers of performance, with policy improvements offering limited gains given current training methods. The authors propose concrete design guidelines and demonstrate a modular system that sets new SotA on standard benchmarks, while revealing a substantial gap to human experts (e.g., 98% SR). This work emphasizes principled evaluation and practical deployment considerations, including dynamic evaluation and plug-in enhancements, to accelerate progress toward robust, real-world ObjectNav systems.

Abstract

Object-Goal Navigation (ObjectNav) is a critical component toward deploying mobile robots in everyday, uncontrolled environments such as homes, schools, and workplaces. In this context, a robot must locate target objects in previously unseen environments using only its onboard perception. Success requires the integration of semantic understanding, spatial reasoning, and long-horizon planning, which is a combination that remains extremely challenging. While reinforcement learning (RL) has become the dominant paradigm, progress has spanned a wide range of design choices, yet the field still lacks a unifying analysis to determine which components truly drive performance. In this work, we conduct a large-scale empirical study of modular RL-based ObjectNav systems, decomposing them into three key components: perception, policy, and test-time enhancement. Through extensive controlled experiments, we isolate the contribution of each and uncover clear trends: perception quality and test-time strategies are decisive drivers of performance, whereas policy improvements with current methods yield only marginal gains. Building on these insights, we propose practical design guidelines and demonstrate an enhanced modular system that surpasses State-of-the-Art (SotA) methods by 6.6% on SPL and by a 2.7% success rate. We also introduce a human baseline under identical conditions, where experts achieve an average 98% success, underscoring the gap between RL agents and human-level navigation. Our study not only sets the SotA performance but also provides principled guidance for future ObjectNav development and evaluation.

What Matters in RL-Based Methods for Object-Goal Navigation? An Empirical Study and A Unified Framework

TL;DR

This paper addresses Object-Goal Navigation by systematically dissecting modular RL pipelines into perception, policy, and test-time enhancement, and comparing their contributions under controlled experiments. It finds perception quality and test-time strategies to be the primary drivers of performance, with policy improvements offering limited gains given current training methods. The authors propose concrete design guidelines and demonstrate a modular system that sets new SotA on standard benchmarks, while revealing a substantial gap to human experts (e.g., 98% SR). This work emphasizes principled evaluation and practical deployment considerations, including dynamic evaluation and plug-in enhancements, to accelerate progress toward robust, real-world ObjectNav systems.

Abstract

Object-Goal Navigation (ObjectNav) is a critical component toward deploying mobile robots in everyday, uncontrolled environments such as homes, schools, and workplaces. In this context, a robot must locate target objects in previously unseen environments using only its onboard perception. Success requires the integration of semantic understanding, spatial reasoning, and long-horizon planning, which is a combination that remains extremely challenging. While reinforcement learning (RL) has become the dominant paradigm, progress has spanned a wide range of design choices, yet the field still lacks a unifying analysis to determine which components truly drive performance. In this work, we conduct a large-scale empirical study of modular RL-based ObjectNav systems, decomposing them into three key components: perception, policy, and test-time enhancement. Through extensive controlled experiments, we isolate the contribution of each and uncover clear trends: perception quality and test-time strategies are decisive drivers of performance, whereas policy improvements with current methods yield only marginal gains. Building on these insights, we propose practical design guidelines and demonstrate an enhanced modular system that surpasses State-of-the-Art (SotA) methods by 6.6% on SPL and by a 2.7% success rate. We also introduce a human baseline under identical conditions, where experts achieve an average 98% success, underscoring the gap between RL agents and human-level navigation. Our study not only sets the SotA performance but also provides principled guidance for future ObjectNav development and evaluation.

Paper Structure

This paper contains 28 sections, 2 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of our work. Our framework encompasses: (1) an empirical study analyzing the impact of different modules, and (2) a unified framework with interchangeable components, enabling users to customize their own object-goal navigation policies.
  • Figure 2: Unified Framework of our experimental setting. A. Perception: RGB-D and pose are fused into a top-down semantic map. B. Policy: The map and auxiliary inputs (e.g., category, orientation) guide action prediction. C. Test-Time Enhancement: Plug-and-play strategies applied at evaluation to boost performance without retraining.
  • Figure 3: Analysis of Test Navigation Scenarios. Sankey plots illustrate the distribution of success and failure cases over 1,000 test episodes across five indoor scenes.
  • Figure 4: Perception Module Overview. RGB images are processed by a pretrained object detector for semantic labels, which are projected with depth-based point clouds to form a voxel map. Summing across height levels yields a multi-layer top-down semantic map, where $K$ is the channel number and $M$ the map size.
  • Figure 5: Action Space. The current goal position is represented by a blue filled dot or line. For both continuous and discrete action spaces, red unfilled dots indicate the possible next goal positions. As shown in the map, in the continuous action space, the next goal can be located anywhere on the map. In contrast, for the discrete action space, the next goal is selected only from a predefined list of candidate positions. The yellow dashed line illustrates a potential navigation trajectory generated by the local policy.
  • ...and 4 more figures