Table of Contents
Fetching ...

Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation

Leyuan Sun, Asako Kanezaki, Guillaume Caron, Yusuke Yoshiyasu

TL;DR

This work tackles Object-Goal Navigation in unseen environments by injecting Large Language Model–derived object-to-room commonsense into a modular, multimodal navigation framework named LROGNav. The method fuses RGB-D, pose, CLIP-based room cues, and LLM-predicted room priors through a multi-channel Swin-Unet to regress three frontier maps and a long-term goal, guided by an uncertainty-weighted multi-task loss. Empirical results in Habitat (Gibson and Matterport3D) show a $SPL$ improvement of 10.6% on average, with competitive $SR$ and robust sim2real transfer demonstrated on a Kobuki robot; ablations validate the contributions of O2R knowledge, multimodal fusion, and the three-frontier design. The work advances practical ObjectNav by combining data-driven perception with language-model knowledge to accelerate efficient target localization in novel environments, bridging perception, semantics, and planning for real-world robotics applications.

Abstract

Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).

Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation

TL;DR

This work tackles Object-Goal Navigation in unseen environments by injecting Large Language Model–derived object-to-room commonsense into a modular, multimodal navigation framework named LROGNav. The method fuses RGB-D, pose, CLIP-based room cues, and LLM-predicted room priors through a multi-channel Swin-Unet to regress three frontier maps and a long-term goal, guided by an uncertainty-weighted multi-task loss. Empirical results in Habitat (Gibson and Matterport3D) show a improvement of 10.6% on average, with competitive and robust sim2real transfer demonstrated on a Kobuki robot; ablations validate the contributions of O2R knowledge, multimodal fusion, and the three-frontier design. The work advances practical ObjectNav by combining data-driven perception with language-model knowledge to accelerate efficient target localization in novel environments, bridging perception, semantics, and planning for real-world robotics applications.

Abstract

Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).
Paper Structure (25 sections, 15 equations, 32 figures, 8 tables, 3 algorithms)

This paper contains 25 sections, 15 equations, 32 figures, 8 tables, 3 algorithms.

Figures (32)

  • Figure 1: This study proposes utilizing LLM-based knowledge of object-to-room relationships to improve the efficiency of the object-goal navigation task. Positive and negative prompts are combined to determine the likelihood of the target object's presence in various room categories.
  • Figure 2: This overview illustrates the proposed approach LROGNav. It involves encoding RGB-D with pose data for a semantic projection mapping module. The direction and distance to the nearest target object are captured and projected using linear embedding, along with three other word embeddings: the target object, CLIP-based room category estimations, and LLM-based potential rooms. A multi-channel Swin-Unet is employed to integrate these modalities. The primary task by one of decoders is to predict frontiers close to the target object. One auxiliary task focuses on predicting frontiers that require further exploration, while another auxiliary task assigns high scores to those frontiers located in rooms with a high likelihood of containing target objects, as pre-determined by LLM-based knowledge. These three tasks are combined to determine the long-term goal, followed by an analytical method to gradually approach the goal until the target is detected.
  • Figure 3: The Object-2-Room relationship matrix utilizing LLM-based knowledge. In the Gibson dataset, the room categories represented on the y-axis are: "bathroom", "bedroom", "child's room", "closet", "corridor", "dining room", "empty room", "exercise room", "garage", "home office", "kitchen", "living room", "lobby", "pantry room", "playroom", "staircase", "storage room", "television room", "utility room". The object categories on the x-axis include: "chair", "couch", "potted plant", "bed", "toilet", "tv", "dining table", "oven", "sink", "refrigerator", "book", "clock", "vase", "cup", "bottle".
  • Figure 4: The box plot compares the scores of "Positive only" prompts ($LLM_{pos}(r,o)$) with those of "Positive w/ negative" prompts ($LLM_{pos}(r,o) - LLM_{neg}(r,o)$).
  • Figure 5: One example from the Gibson xia2018gibson dataset is the Beechwood house. From left to right, it shows the mesh with texture, the mesh with room annotations, and the room segmentation for each floor.
  • ...and 27 more figures