Map-based Modular Approach for Zero-shot Embodied Question Answering
Koya Sakamoto, Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Motoaki Kawanabe
TL;DR
This work tackles Embodied Question Answering by addressing the limitations of end-to-end and simulation-based methods in real-world settings. It proposes a map-based modular framework that integrates language-guided navigation with memory-informed VQA, leveraging foundation models for open-vocabulary reasoning. Key findings show competitive MP3D-EQA performance (VQA top-1 ≈ $0.43$) and real-world feasibility with notable success rates and clear failure modes linked to SLAM and object detection, suggesting robust applicability with targeted improvements. The approach offers a scalable path toward open-vocabulary, real-world EQA by combining semantic mapping, ITM verification, and VQA on memorized views, potentially reducing domain gaps between simulation and reality.
Abstract
Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments and identify objects in response to human queries. However, existing EQA methods often rely on simulated environments and operate with limited vocabularies. This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments. By leveraging foundation models, our method facilitates answering a diverse range of questions using natural language. We conducted extensive experiments in both virtual and real-world settings, demonstrating the robustness of our approach in navigating and comprehending queries within unknown environments.
