VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps
Senthil Hariharan Arul, Dhruva Kumar, Vivek Sugirtharaj, Richard Kim, Xuewei, Qi, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha
TL;DR
VLPG-Nav addresses the problem of object navigation in household settings with an additional requirement: centering the object in the robot's camera view. It introduces a Visual Language Pose Graph (VLPG) to store VL embeddings tied to robot poses and uses clustering to generate informative initial viewpoints, followed by an object-centering cost to refine pose and a local search driven by an object localization probability map when the object is not visible. The three core contributions are (i) VLPG-based initial viewpoint prediction, (ii) an object-centering formulation with orient and zoom costs, and (iii) a probability-map-guided local search that replans viewpoints to recover visibility under occlusion or displacement; evaluated in simulation and real-world experiments, VLPG-Nav achieves improved SAE over baselines. The work demonstrates practical benefits for memory-constrained robots by leveraging environment priors and online observations to robustly locate and frame target objects, with potential impact on home-assistant robotics and related tasks.
Abstract
We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges like object occlusion, displacement, and the robot's localization error can prevent visibility. We build an object localization probability map that leverages the robot's current observations and prior VLPG. When the object isn't visible, the probability map is updated and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot's pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot's camera view, outperforming the selected baselines in the evaluation metrics.
