Table of Contents
Fetching ...

MOPA: Modular Object Navigation with PointGoal Agents

Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, Angel X. Chang

TL;DR

MOPA proposes a modular Object Navigation framework that decouples object detection, semantic mapping, exploration, and navigation, enabling reuse of pretrained PointNav policies for long-horizon tasks. By building a top-down semantic map and testing multiple exploration strategies, the authors show that a simple Uniform exploration policy combined with a PointNav-based navigator can outperform more complex, end-to-end or analytically-planned baselines. The creation of MultiON 2.0 provides a large, challenging benchmark with natural and cylinder objects, distractors, and longer episodes to study generalization and transfer. The results demonstrate strong transferability to unseen environments and highlight practical design choices—most notably, the benefit of modularity and the surprising efficacy of Uniform exploration. Overall, the work suggests that leveraging transfer learning and simple heuristics within a modular pipeline can yield robust, scalable solutions for long-horizon embodied navigation tasks.

Abstract

We propose a simple but effective modular approach MOPA (Modular ObjectNav with PointGoal agents) to systematically investigate the inherent modularity of the object navigation task in Embodied AI. MOPA consists of four modules: (a) an object detection module trained to identify objects from RGB images, (b) a map building module to build a semantic map of the observed objects, (c) an exploration module enabling the agent to explore the environment, and (d) a navigation module to move to identified target objects. We show that we can effectively reuse a pretrained PointGoal agent as the navigation model instead of learning to navigate from scratch, thus saving time and compute. We also compare various exploration strategies for MOPA and find that a simple uniform strategy significantly outperforms more advanced exploration methods.

MOPA: Modular Object Navigation with PointGoal Agents

TL;DR

MOPA proposes a modular Object Navigation framework that decouples object detection, semantic mapping, exploration, and navigation, enabling reuse of pretrained PointNav policies for long-horizon tasks. By building a top-down semantic map and testing multiple exploration strategies, the authors show that a simple Uniform exploration policy combined with a PointNav-based navigator can outperform more complex, end-to-end or analytically-planned baselines. The creation of MultiON 2.0 provides a large, challenging benchmark with natural and cylinder objects, distractors, and longer episodes to study generalization and transfer. The results demonstrate strong transferability to unseen environments and highlight practical design choices—most notably, the benefit of modularity and the surprising efficacy of Uniform exploration. Overall, the work suggests that leveraging transfer learning and simple heuristics within a modular pipeline can yield robust, scalable solutions for long-horizon embodied navigation tasks.

Abstract

We propose a simple but effective modular approach MOPA (Modular ObjectNav with PointGoal agents) to systematically investigate the inherent modularity of the object navigation task in Embodied AI. MOPA consists of four modules: (a) an object detection module trained to identify objects from RGB images, (b) a map building module to build a semantic map of the observed objects, (c) an exploration module enabling the agent to explore the environment, and (d) a navigation module to move to identified target objects. We show that we can effectively reuse a pretrained PointGoal agent as the navigation model instead of learning to navigate from scratch, thus saving time and compute. We also compare various exploration strategies for MOPA and find that a simple uniform strategy significantly outperforms more advanced exploration methods.
Paper Structure (26 sections, 13 figures, 13 tables)

This paper contains 26 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Approach Overview. We tackle long-horizon navigation tasks by leveraging their inherent modularity. The agent uses an exploration module to seek the goal in the environment. Once the goal is observed, a navigation module moves the agent towards the goal. While exploring, the agent memorizes objects it sees along the way so it can more efficiently navigate to them later.
  • Figure 2: Architecture. We adopt a modular approach with PointNav agents (MOPA) to tackle object navigation tasks. The Object detection module transforms raw RGB to semantic labels. These are projected onto a top-down semantic map using depth observations by the Map building module. The map is passed as input for the Exploration module to uncover unseen areas of the environment. A Planning module then selects a relative goal (from either the task goal if on map or an exploratory goal). Finally, a low-level Navigation policy predicts the action for the agent to execute.
  • Figure 3: MultiON performance analysis. Error modes include the agent running out of step limit or stopping at a location far away from the goal. For those cases where the agent ran out of steps, it either has not yet discovered the goal or has discovered the goal but failed to stop near it.
  • Figure 4: ObjectNav performance analysis. Examples of successful (64%) and failed episodes (36%) with OracleSem. Some episodes fail even when the agent is within 1m of the goal bounding box with the goal in sight (top middle), indicating that the viewpoints sampled for determining success in ObjectNav are sparse.
  • Figure 5: Comparing path lengths across tasks. (a) shows that 3ON2.0 has longer episodes than both Habitat ObjectNav 2021 batra2020objectnav and the original 3ON wani2020multion ( 26m vs. 23m), with 5ON2.0 having the longest average episode length. (b) shows that the average distance between the object-goal pairs is greater in 3ON2.0 than 3ON. With more object-goals, 5ON2.0 has more closely-spaced objects. These plots show that MultiON 2.0 contains harder episodes than Habitat ObjectNav 2021 and 3ON, with longer average shortest path and with object-goals placed farther apart.
  • ...and 8 more figures