Table of Contents
Fetching ...

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

Daniel Ekpo, Mara Levy, Saksham Suri, Chuong Huynh, Abhinav Shrivastava

TL;DR

VeriGraph is a novel framework that integrates VLMs for robotic planning while verifying action feasibility, and employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement.

Abstract

Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

TL;DR

VeriGraph is a novel framework that integrates VLMs for robotic planning while verifying action feasibility, and employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement.

Abstract

Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.

Paper Structure

This paper contains 17 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 2: VeriGraph is able to utilize an initial scene image and a reference image which may or may not be from the same setting. Using the two images, our approach generates the corresponding scene graphs. Using a VLM as the planner along with execution-verifiability, we generate and execute a plan using the robot.
  • Figure 3: Overview of VeriGraph. Two images are input: the start scene (current state) and the goal scene (desired state). A scene graph generator extracts objects and relationships from each image, which are then processed by the iterative planning module. This module evaluates suggested actions from the VLM, checking for constraint violations. If a violation occurs, the VLM suggests a new action; if not, the action is executed. This loop continues until the environment matches the goal scene.
  • Figure 4: An example of how the scene graphs are structured for individual images. First, nodes are created for each object in the image, and then edges are added to represent the relationships between different objects. The relationship is represented with a solid line. Relationships are directional and go towards the object that is "on" another object. This is so objects on top are represented as leaf nodes and are not blocked from moving.
  • Figure 5: Iterative planning: The planner suggests the first action. Our model detects that the plate cannot be moved due to objects on top and requests a new plan. The planner responds with a better action, continuing until the task is complete.
  • Figure 6: Example scenes from the evaluation dataset; (top) blocks, (middle) kitchen, and (bottom) tableware scene.