Table of Contents
Fetching ...

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer

TL;DR

GraphEQA tackles Embodied Question Answering by grounding a Vision-Language Model-based planner in a real-time, online 3D metric-semantic scene graph (3DSG) complemented by task-relevant visual memory. It constructs semantically enriched scene graphs with frontier-objects links and room labels, and uses a hierarchical planner that reasons over rooms, regions, and objects as well as semantically relevant frontiers to guide exploration. The approach demonstrates superior success rates and reduced planning steps on HM-EQA/OpenEQA in simulation and validates practical viability in real-world indoor environments, highlighting the benefits of compact multimodal memory and real-time grounding for long-horizon robotics tasks. The work contributes a cohesive framework that integrates online 3DSG construction, semantic enrichment, memory, and hierarchical VLM planning, advancing open-world EQA and grounded robotic exploration.

Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. We further demonstrate GraphEQA in multiple real-world home and office environments.

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

TL;DR

GraphEQA tackles Embodied Question Answering by grounding a Vision-Language Model-based planner in a real-time, online 3D metric-semantic scene graph (3DSG) complemented by task-relevant visual memory. It constructs semantically enriched scene graphs with frontier-objects links and room labels, and uses a hierarchical planner that reasons over rooms, regions, and objects as well as semantically relevant frontiers to guide exploration. The approach demonstrates superior success rates and reduced planning steps on HM-EQA/OpenEQA in simulation and validates practical viability in real-world indoor environments, highlighting the benefits of compact multimodal memory and real-time grounding for long-horizon robotics tasks. The work contributes a cohesive framework that integrates online 3DSG construction, semantic enrichment, memory, and hierarchical VLM planning, advancing open-world EQA and grounded robotic exploration.

Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. We further demonstrate GraphEQA in multiple real-world home and office environments.

Paper Structure

This paper contains 39 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overview of GraphEQA: A novel approach for utilizing real-time 3D metric-semantic hierarchical scene graphs and task-relevant images as multimodal memory for grounding vision-language based planners to solve embodied question answering tasks in unseen environments.
  • Figure 2: Overall GraphEQA architecture. As the agent explores the environment, it used its sensor data (RGBD images, semantic map, camera poses and intrinsics) to construct a 3D metric-semantic hierarchical scene graph (3DSG) as well as a 2D occupancy map for frontier selection in real time. The constructed 3DSG is enriched as discussed in \ref{['enrichment']}. From the set of images collected during each trajectory execution, a task-relevant subset is selected, called the task-relevant visual memory ( \ref{['keyframe']}). A VLM-based planner ( \ref{['planner']}) takes as input the enriched scene graph, task-relevant visual memory, a history of past states and actions, and the embodied question and outputs the answer, its confidence in the selected answer, and the next step it needs to take in the environment. If the VLM agent is confident in its answer, the episode is terminated, else the proposed action is executed in the environment and the process repeats.
  • Figure 3: VLM Planner Architecture. The Hierarchical Vision-Language planner takes as input the question, enriched scene graph, task-relevant visual memory, current state of the robot (position and room) and a history of past states, actions, answers and confidence values. The planner chooses the next <Goto_Object_node> action hierarchically by first selecting the room node and then the object node. The <Goto_Frontier_node> action is chosen based on the object nodes connected to the frontier via edges in the scene graph. The planner is asked to output a brief reasoning behind choosing each action, an answer, confidence in its answer, reasoning behind the answer and confidence, the next action, a brief description of the scene graph, and the visual memory.
  • Figure 4: Images from real-world experiments, deploying GraphEQA on the Hello Robot Stretch RE2 platform in two unique home environments (a, b). Each set of images is from the head camera on the Stretch robot, representing the top-K task-relevant images at each planning step as it constructs the scene graph and attempts to answer the question with high confidence. Provided under the images are planning step, answers, confidence, and explanations output from the VLM planner.
  • Figure 5:
  • ...and 12 more figures