Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation
Leyuan Sun, Asako Kanezaki, Guillaume Caron, Yusuke Yoshiyasu
TL;DR
This work tackles Object-Goal Navigation in unseen environments by injecting Large Language Model–derived object-to-room commonsense into a modular, multimodal navigation framework named LROGNav. The method fuses RGB-D, pose, CLIP-based room cues, and LLM-predicted room priors through a multi-channel Swin-Unet to regress three frontier maps and a long-term goal, guided by an uncertainty-weighted multi-task loss. Empirical results in Habitat (Gibson and Matterport3D) show a $SPL$ improvement of 10.6% on average, with competitive $SR$ and robust sim2real transfer demonstrated on a Kobuki robot; ablations validate the contributions of O2R knowledge, multimodal fusion, and the three-frontier design. The work advances practical ObjectNav by combining data-driven perception with language-model knowledge to accelerate efficient target localization in novel environments, bridging perception, semantics, and planning for real-world robotics applications.
Abstract
Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).
