Table of Contents
Fetching ...

Commonsense Scene Graph-based Target Localization for Object Search

Wenqi Ge, Chao Tang, Hong Zhang

TL;DR

This work tackles efficient target localization for object search in household robotics by introducing a Commonsense Scene Graph (CSG) that fuses room-level layout information from pre-built maps with object-level commonsense obtained from a large language model. Target localization is cast as a link-prediction task in the CSG via the CSG-TL module, which informs an overall CSG-OS object-search pipeline that generates candidate navigation points and updates the graph with new observations. Empirical results show state-of-the-art performance in zero-shot settings on ScanNet and AI2THOR simulations, and successful real-world deployment on a Jackal robot, validating the practicality of combining spatial structure with commonsense knowledge for robust household object search.

Abstract

Object search is a fundamental skill for household robots, yet the core problem lies in the robot's ability to locate the target object accurately. The dynamic nature of household environments, characterized by the arbitrary placement of daily objects by users, makes it challenging to perform target localization. To efficiently locate the target object, the robot needs to be equipped with knowledge at both the object and room level. However, existing approaches rely solely on one type of knowledge, leading to unsatisfactory object localization performance and, consequently, inefficient object search processes. To address this problem, we propose a commonsense scene graph-based target localization, CSG-TL, to enhance target object search in the household environment. Given the pre-built map with stationary items, the robot models the room-level knowledge with object-level commonsense knowledge generated by a large language model (LLM) to a commonsense scene graph (CSG), supporting both types of knowledge for CSG-TL. To demonstrate the superiority of CSG-TL on target localization, extensive experiments are performed on the real-world ScanNet dataset and the AI2THOR simulator. Moreover, we have extended CSG-TL to an object search framework, CSG-OS, validated in both simulated and real-world environments. Code and videos are available at https://sites.google.com/view/csg-os.

Commonsense Scene Graph-based Target Localization for Object Search

TL;DR

This work tackles efficient target localization for object search in household robotics by introducing a Commonsense Scene Graph (CSG) that fuses room-level layout information from pre-built maps with object-level commonsense obtained from a large language model. Target localization is cast as a link-prediction task in the CSG via the CSG-TL module, which informs an overall CSG-OS object-search pipeline that generates candidate navigation points and updates the graph with new observations. Empirical results show state-of-the-art performance in zero-shot settings on ScanNet and AI2THOR simulations, and successful real-world deployment on a Jackal robot, validating the practicality of combining spatial structure with commonsense knowledge for robust household object search.

Abstract

Object search is a fundamental skill for household robots, yet the core problem lies in the robot's ability to locate the target object accurately. The dynamic nature of household environments, characterized by the arbitrary placement of daily objects by users, makes it challenging to perform target localization. To efficiently locate the target object, the robot needs to be equipped with knowledge at both the object and room level. However, existing approaches rely solely on one type of knowledge, leading to unsatisfactory object localization performance and, consequently, inefficient object search processes. To address this problem, we propose a commonsense scene graph-based target localization, CSG-TL, to enhance target object search in the household environment. Given the pre-built map with stationary items, the robot models the room-level knowledge with object-level commonsense knowledge generated by a large language model (LLM) to a commonsense scene graph (CSG), supporting both types of knowledge for CSG-TL. To demonstrate the superiority of CSG-TL on target localization, extensive experiments are performed on the real-world ScanNet dataset and the AI2THOR simulator. Moreover, we have extended CSG-TL to an object search framework, CSG-OS, validated in both simulated and real-world environments. Code and videos are available at https://sites.google.com/view/csg-os.
Paper Structure (11 sections, 8 equations, 6 figures, 5 tables)

This paper contains 11 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An overview of target object location strategies comparing human room-level and object-level commonsense-inspired reasoning to our CSG-TL method, which integrates commonsense scene graph (CSG) with room-level and object-level knowledge for efficient target localization.
  • Figure 2: CSG-based object search (CSG-OS) pipeline. Firstly, the user queries the target object, which is then encoded with LLM-derived commonsense knowledge to form the target node $V_t$. Following this, the CSG is then constructed from a pre-built map of stationary items, incorporating $V_t$ for target localization through CSG-TL, detailed in Sec. \ref{['sec:4-1']}. Thirdly, nodes correlated with the target are clustered based on their locations and the likelihood of their predicted correlations, establishing a set of candidate search points. Finally, the robot navigates to the first candidate points in turn to search for the target. If found, the task is done successfully. Otherwise, the robot updates the CSG by newly detected objects and repeats the search steps until it finds the target or exceeds a preset threshold on thenumber of steps.
  • Figure 3: Illustration of the CSG construction process: The scene graph is initially created by identifying stationary items from the pre-built map based on spatial relationships, as defined in Eq. \ref{['eq:1']}. Subsequently, commonsense knowledge relevant to human inference of correlations is incorporated through LLM prompts.
  • Figure 4: Illustration of the CSG-TL structure. Given the CSG and a target node $V_t$, the CSG-TL estimates the probability of correlation between the target and nodes within the CSG. A correlation exists if $p>0.5$ indicates a likely co-occurrence between the target and the respective object.
  • Figure 5: Visualization of target localization, the indicates the top1 candidate navigation point, and representing the top2 and top3 candidates, respectively (calculation detailed in \ref{['sec:4-2']}). The prior knowledge they utilize is the pre-built map with stationary items only. The target's actual location in the full 3D map is marked with hollow-red-circle, and the corresponding location on the 2D fixtures-only map is denoted with red-circle. This visualization demonstrates the accuracy of our CSG-TL model in predicting the target's location against other methods.
  • ...and 1 more figures