Table of Contents
Fetching ...

Map-based Modular Approach for Zero-shot Embodied Question Answering

Koya Sakamoto, Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Motoaki Kawanabe

TL;DR

This work tackles Embodied Question Answering by addressing the limitations of end-to-end and simulation-based methods in real-world settings. It proposes a map-based modular framework that integrates language-guided navigation with memory-informed VQA, leveraging foundation models for open-vocabulary reasoning. Key findings show competitive MP3D-EQA performance (VQA top-1 ≈ $0.43$) and real-world feasibility with notable success rates and clear failure modes linked to SLAM and object detection, suggesting robust applicability with targeted improvements. The approach offers a scalable path toward open-vocabulary, real-world EQA by combining semantic mapping, ITM verification, and VQA on memorized views, potentially reducing domain gaps between simulation and reality.

Abstract

Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments and identify objects in response to human queries. However, existing EQA methods often rely on simulated environments and operate with limited vocabularies. This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments. By leveraging foundation models, our method facilitates answering a diverse range of questions using natural language. We conducted extensive experiments in both virtual and real-world settings, demonstrating the robustness of our approach in navigating and comprehending queries within unknown environments.

Map-based Modular Approach for Zero-shot Embodied Question Answering

TL;DR

This work tackles Embodied Question Answering by addressing the limitations of end-to-end and simulation-based methods in real-world settings. It proposes a map-based modular framework that integrates language-guided navigation with memory-informed VQA, leveraging foundation models for open-vocabulary reasoning. Key findings show competitive MP3D-EQA performance (VQA top-1 ≈ ) and real-world feasibility with notable success rates and clear failure modes linked to SLAM and object detection, suggesting robust applicability with targeted improvements. The approach offers a scalable path toward open-vocabulary, real-world EQA by combining semantic mapping, ITM verification, and VQA on memorized views, potentially reducing domain gaps between simulation and reality.

Abstract

Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments and identify objects in response to human queries. However, existing EQA methods often rely on simulated environments and operate with limited vocabularies. This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments. By leveraging foundation models, our method facilitates answering a diverse range of questions using natural language. We conducted extensive experiments in both virtual and real-world settings, demonstrating the robustness of our approach in navigating and comprehending queries within unknown environments.
Paper Structure (20 sections, 8 figures, 2 tables)

This paper contains 20 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Example of our method: We provide an agent with a question and the agent proceeds to explore an unknown environment. When it encounters a potential target object, it verifies if this is indeed the correct target object through image-text matching (ITM). If the ITM score falls below a pre-determined threshold, the agent continues its exploration. If the ITM score exceeds the threshold, the agent stops exploration and performs VQA.
  • Figure 2: Map-based Modular Embodied Question Answering Model Overview. The proposed method comprises the Navigation module (outlined in blue) and the VQA module (outlined in red). The Navigation module consists of the Perception module and a set of Policies. The Perception module incrementally builds a 2D map, storing images along with their image-text matching scores. The Global Policy selects a long-term goal based on the 2D map and its frontiers. The Deterministic Local Policy outputs actions, and finally, the VQA module provides an answer based on the memorized images and the given question.
  • Figure 3: Dataset pre-processing using gpt-35-turbo-0613. It extracts a target object category from a given question for ObjNav and converts a question into a declarative text for image-text matching.
  • Figure 4: VQA top-1 accuracy on MP3D-EQA train'. The scores of LLaVA-v1.5-7b and LLaVA-v1.5-13b are higher than those of others.
  • Figure 5: ROC Curves of Image-text Matching of MP3D-EQA EqaMatterport at 'train' split.
  • ...and 3 more figures