Table of Contents
Fetching ...

FOM-Nav: Frontier-Object Maps for Object Goal Navigation

Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid

TL;DR

FOM-Nav introduces Frontier-Object Maps to maintain online, semantically rich memory for object-goal navigation. It uses a vision-language model to predict high-level navigation goals from encoded frontiers, objects, and path history, paired with a robust low-level planner (FMM + A*) for trajectory execution. The approach is trained on automatically constructed navigation datasets from real-world scans and achieves state-of-the-art results on MP3D and HM3D, with strong exploration efficiency as reflected in SPL, plus successful real-world deployment. Overall, the work offers a scalable, modular pipeline that improves long-horizon ObjectNav by uniting online hybrid maps with powerful multimodal reasoning.

Abstract

This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.

FOM-Nav: Frontier-Object Maps for Object Goal Navigation

TL;DR

FOM-Nav introduces Frontier-Object Maps to maintain online, semantically rich memory for object-goal navigation. It uses a vision-language model to predict high-level navigation goals from encoded frontiers, objects, and path history, paired with a robust low-level planner (FMM + A*) for trajectory execution. The approach is trained on automatically constructed navigation datasets from real-world scans and achieves state-of-the-art results on MP3D and HM3D, with strong exploration efficiency as reflected in SPL, plus successful real-world deployment. Overall, the work offers a scalable, modular pipeline that improves long-horizon ObjectNav by uniting online hybrid maps with powerful multimodal reasoning.

Abstract

This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.

Paper Structure

This paper contains 22 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The proposed frontier-object map is a rich representation of objects and frontiers (boundaries of the explored scene), displayed here as colored point clouds and red lines. It encodes geometric, distance and visual/textual information for frontiers and objects.
  • Figure 2: FOM-Nav framework. At each step, the agent receives a posed RGB-D image. First, we back-project depth into a point cloud to maintain a 3D obstacle map, from which we derive 2D obstacle and exploration maps and frontiers. Simultaneously, we segment objects, and extract and store their geometric, visual and textual features in a 3D object map. Then, a VLM processes frontiers, objects, and history path to predict a high-level navigation goal. Finally, a low-level path planner (A${}^*$ and fast marching algorithm) plans a path to the goal.
  • Figure 3: Our high-level goal prediction module employs a transformer-based LLM to processes a rich semantic representation of the scene, including objects, frontiers, path history and texts. Objects and frontiers consist of visual, geometric, distance, and textual features. The prediction head operates in two phases: 1) the transformer's last output token $E_N$ is classified as either a frontier or object destination type; 2) $E_N$ is matched against output tokens for objects or frontiers using a learnable matrix $A$ to identify the best navigation goal.
  • Figure 4: Visualization of a navigation episode to a bed in an HM3D environment ramakrishnan2021hm3d. Left: RGB view from the agent. Center: Online frontier–object maps: grey = inflated obstacles, green = explored area, purple dots = objects, red circles = frontiers, and filled orange/cyan circles = selected object/frontier goals. If the chosen object lies in unexplored space, the agent first moves toward the corresponding frontier. Planned and executed paths are shown in pink and navy. Right: Agent trajectory (navy), ground-truth object (white box), and stopping area (pink) on the GT map.
  • Figure S1: Visualizations of RGB images (top left of each block) and the masks used to compute visual features for each frontier. The masks are the highlighted parts of the darker images, localized around each frontier. Some frontiers are located on image borders and their mask is computed as a vertical stripe, see the last image in the top blocks.
  • ...and 5 more figures