GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena; Blake Buchanan; Chris Paxton; Peiqi Liu; Bingqing Chen; Narunas Vaskevicius; Luigi Palmieri; Jonathan Francis; Oliver Kroemer

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer

TL;DR

GraphEQA tackles Embodied Question Answering by grounding a Vision-Language Model-based planner in a real-time, online 3D metric-semantic scene graph (3DSG) complemented by task-relevant visual memory. It constructs semantically enriched scene graphs with frontier-objects links and room labels, and uses a hierarchical planner that reasons over rooms, regions, and objects as well as semantically relevant frontiers to guide exploration. The approach demonstrates superior success rates and reduced planning steps on HM-EQA/OpenEQA in simulation and validates practical viability in real-world indoor environments, highlighting the benefits of compact multimodal memory and real-time grounding for long-horizon robotics tasks. The work contributes a cohesive framework that integrates online 3DSG construction, semantic enrichment, memory, and hierarchical VLM planning, advancing open-world EQA and grounded robotic exploration.

Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. We further demonstrate GraphEQA in multiple real-world home and office environments.

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

TL;DR

Abstract

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)