Table of Contents
Fetching ...

VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Senthil Hariharan Arul, Dhruva Kumar, Vivek Sugirtharaj, Richard Kim, Xuewei, Qi, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

TL;DR

VLPG-Nav addresses the problem of object navigation in household settings with an additional requirement: centering the object in the robot's camera view. It introduces a Visual Language Pose Graph (VLPG) to store VL embeddings tied to robot poses and uses clustering to generate informative initial viewpoints, followed by an object-centering cost to refine pose and a local search driven by an object localization probability map when the object is not visible. The three core contributions are (i) VLPG-based initial viewpoint prediction, (ii) an object-centering formulation with orient and zoom costs, and (iii) a probability-map-guided local search that replans viewpoints to recover visibility under occlusion or displacement; evaluated in simulation and real-world experiments, VLPG-Nav achieves improved SAE over baselines. The work demonstrates practical benefits for memory-constrained robots by leveraging environment priors and online observations to robustly locate and frame target objects, with potential impact on home-assistant robotics and related tasks.

Abstract

We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges like object occlusion, displacement, and the robot's localization error can prevent visibility. We build an object localization probability map that leverages the robot's current observations and prior VLPG. When the object isn't visible, the probability map is updated and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot's pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot's camera view, outperforming the selected baselines in the evaluation metrics.

VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

TL;DR

VLPG-Nav addresses the problem of object navigation in household settings with an additional requirement: centering the object in the robot's camera view. It introduces a Visual Language Pose Graph (VLPG) to store VL embeddings tied to robot poses and uses clustering to generate informative initial viewpoints, followed by an object-centering cost to refine pose and a local search driven by an object localization probability map when the object is not visible. The three core contributions are (i) VLPG-based initial viewpoint prediction, (ii) an object-centering formulation with orient and zoom costs, and (iii) a probability-map-guided local search that replans viewpoints to recover visibility under occlusion or displacement; evaluated in simulation and real-world experiments, VLPG-Nav achieves improved SAE over baselines. The work demonstrates practical benefits for memory-constrained robots by leveraging environment priors and online observations to robustly locate and frame target objects, with potential impact on home-assistant robotics and related tasks.

Abstract

We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges like object occlusion, displacement, and the robot's localization error can prevent visibility. We build an object localization probability map that leverages the robot's current observations and prior VLPG. When the object isn't visible, the probability map is updated and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot's pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot's camera view, outperforming the selected baselines in the evaluation metrics.
Paper Structure (23 sections, 15 equations, 6 figures, 3 tables)

This paper contains 23 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (Fig. 1-a) The overall framework of VLPG-Nav. Based on the object-related prompt, an initial viewpoint is computed using the visual language pose graph (VLPG) and clustering. Subsequently, we center the object in the camera image through our proposed object-centering cost function. When the object is not in view at the chosen viewpoint, we replan using an object localization probability map constructed using the local occupancy map and VLPG. (Fig. 1-b) A real-world example where the robot is tasked to view a plant. The figure shows the camera view from the initial viewpoint, and the robot successfully views the plant. (Fig. 1-c) In this case, we obstruct the initial viewpoint with an obstacle. The figure shows the obstructed camera view from the initial viewpoint. (Fig. 1-d) Since the object is occluded, the local search identifies an alternative viewpoint, and the robot obtains an unobstructed view of the plant.
  • Figure 2: Local Search: We illustrate a scenario with the robot tasked with viewing an oven. (Fig. 3-a) The initial viewpoint guess results in the robot's view of the oven being occluded. (Fig. 3-b) Based on the viewpoint cluster obtained from VLPG and clustering, we compute a probability map for localizing the oven. The obstacles are represented in black; regions with higher intensity blue are more likely to localize the object. (Fig. 3-c) Using the probability map, we sample a set of viewpoints from which a suitable alternative viewpoint is chosen based on our optimization objective.
  • Figure 3: The red arrow shows the viewpoint computed as an initial guess, while the green arrow is the robot's pose at the end state. We can observe that the object centering directs the robot to the object of interest.
  • Figure 4: An example demonstrating "Local Search", where the robot is assigned the task of viewing an oven. (Fig. 3-a) The initial viewpoint guess results in the robot view being occluded. (Fig. 3-b) The local search identifies a suitable alternative viewpoint (red), which gets a better view of the object. (Fig. 3-c) The robot then moves to the replanned viewpoint and successfully views the object.
  • Figure 5: An illustrative example of the viewpoints generation using prior environment knowledge. (Fig. 5-a) The environment's floor plan. (Fig. 5-b) Ground truth locations of the object of interest on the 2D floor plan. (Fig. 5-c) The 2D object localization computed using VLMap vlmap. (Fig. 5-d) The figure represents the object viewpoint and camera FoV generated by our approach. In this case, the walls (depicted in black) are from the occupancy map and are not computed using the VLPG. (Fig. 5-e) Depicts the objects of interest with specific colors used on the map. The proposed method improves object localization and consequently object goal navigation since object viewpoints are linked to the SLAM pose graph, which gets optimized over time. Computationally, VLPG is memory efficient and does not require depth maps for 2D projection, making it effective for deployment on low-compute robots.
  • ...and 1 more figures