FOM-Nav: Frontier-Object Maps for Object Goal Navigation
Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid
TL;DR
FOM-Nav introduces Frontier-Object Maps to maintain online, semantically rich memory for object-goal navigation. It uses a vision-language model to predict high-level navigation goals from encoded frontiers, objects, and path history, paired with a robust low-level planner (FMM + A*) for trajectory execution. The approach is trained on automatically constructed navigation datasets from real-world scans and achieves state-of-the-art results on MP3D and HM3D, with strong exploration efficiency as reflected in SPL, plus successful real-world deployment. Overall, the work offers a scalable, modular pipeline that improves long-horizon ObjectNav by uniting online hybrid maps with powerful multimodal reasoning.
Abstract
This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.
