Table of Contents
Fetching ...

Cognitive Planning for Object Goal Navigation using Generative AI Models

Arjun P S, Andrew Melnik, Gora Chand Nandi

TL;DR

This work proposes a 3D modular scene representation, enriched with semantic descriptions, that enables a robot to navigate unfamiliar environments by leveraging LLMs and LVLMs to understand the semantic structure of the scene.

Abstract

Recent advancements in Generative AI, particularly in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), offer new possibilities for integrating cognitive planning into robotic systems. In this work, we present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging LLMs and LVLMs to understand the semantic structure of the scene. To address the challenge of representing complex environments without overwhelming the system, we propose a 3D modular scene representation, enriched with semantic descriptions. This representation is dynamically pruned using an LLM-based mechanism, which filters irrelevant information and focuses on task-specific data. By combining these elements, our system generates high-level sub-goals that guide the exploration of the robot toward the target object. We validate our approach in simulated environments, demonstrating its ability to enhance object search efficiency while maintaining scalability in complex settings.

Cognitive Planning for Object Goal Navigation using Generative AI Models

TL;DR

This work proposes a 3D modular scene representation, enriched with semantic descriptions, that enables a robot to navigate unfamiliar environments by leveraging LLMs and LVLMs to understand the semantic structure of the scene.

Abstract

Recent advancements in Generative AI, particularly in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), offer new possibilities for integrating cognitive planning into robotic systems. In this work, we present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging LLMs and LVLMs to understand the semantic structure of the scene. To address the challenge of representing complex environments without overwhelming the system, we propose a 3D modular scene representation, enriched with semantic descriptions. This representation is dynamically pruned using an LLM-based mechanism, which filters irrelevant information and focuses on task-specific data. By combining these elements, our system generates high-level sub-goals that guide the exploration of the robot toward the target object. We validate our approach in simulated environments, demonstrating its ability to enhance object search efficiency while maintaining scalability in complex settings.
Paper Structure (23 sections, 7 figures, 2 tables)

This paper contains 23 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our framework with Homerobot Strech in Habitat simulation environment. The Robot in this episode is tasked to find a pillow. The agent, after considering all the objects in the scene (3D scene modular representation), decides to explore near the couch to find the pillow.
  • Figure 2: Architecture of the proposed pipeline.The agent explores the environment and collects observations (RGBD image and Pose). An open-vocabulary segmentation module is used to identify the objects in the current frame. Pruner takes in these detected segments and prunes out unwanted segments. The pruned segments are either initialized as a new node in the 3D scene representation or merged with an existing one, based on a similarity criteria. All new nodes are captioned with LLaVA, to provide semantic information to the LLM based planner. The agent then chooses a node to explore closer, to find the target object $G_o$. While doing so, the agent stores frame wise information in the short term memory module. If the agent decides that no objects in the 3D scene representation has a good chance of finding the target object $G_o$ closer to it, it continues to explore the scene and build the 3D scene representation.
  • Figure 3: Examples of frames retrieved from the short term memory module, in which the target object is detected. The top 8 frames corresponds to those in which the target object orange was visible and the bottom 8 frames have the target object soda can visible in them.
  • Figure 4: Input object list to LLM and the corresponding pruned list. The pruner here uses GPT-3.5 Turbo
  • Figure :
  • ...and 2 more figures