Map-based Modular Approach for Zero-shot Embodied Question Answering

Koya Sakamoto; Daichi Azuma; Taiki Miyanishi; Shuhei Kurita; Motoaki Kawanabe

Map-based Modular Approach for Zero-shot Embodied Question Answering

Koya Sakamoto, Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Motoaki Kawanabe

TL;DR

This work tackles Embodied Question Answering by addressing the limitations of end-to-end and simulation-based methods in real-world settings. It proposes a map-based modular framework that integrates language-guided navigation with memory-informed VQA, leveraging foundation models for open-vocabulary reasoning. Key findings show competitive MP3D-EQA performance (VQA top-1 ≈ $0.43$) and real-world feasibility with notable success rates and clear failure modes linked to SLAM and object detection, suggesting robust applicability with targeted improvements. The approach offers a scalable path toward open-vocabulary, real-world EQA by combining semantic mapping, ITM verification, and VQA on memorized views, potentially reducing domain gaps between simulation and reality.

Abstract

Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments and identify objects in response to human queries. However, existing EQA methods often rely on simulated environments and operate with limited vocabularies. This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments. By leveraging foundation models, our method facilitates answering a diverse range of questions using natural language. We conducted extensive experiments in both virtual and real-world settings, demonstrating the robustness of our approach in navigating and comprehending queries within unknown environments.

Map-based Modular Approach for Zero-shot Embodied Question Answering

TL;DR

) and real-world feasibility with notable success rates and clear failure modes linked to SLAM and object detection, suggesting robust applicability with targeted improvements. The approach offers a scalable path toward open-vocabulary, real-world EQA by combining semantic mapping, ITM verification, and VQA on memorized views, potentially reducing domain gaps between simulation and reality.

Abstract

Paper Structure (20 sections, 8 figures, 2 tables)

This paper contains 20 sections, 8 figures, 2 tables.

INTRODUCTION
RELATED WORK
Visual Question Answering in 3D Space
Language-Guided Object Goal Navigation
Embodied Referring Expression Comprehension
Question Answering for Embodied Agents
Proposed Method
Task Definition
Overview of EQA Framework
Language-guided Navigation Module
Image-text Matching Module
Visual Question Answering Module
EXPERIMENTS
EQA Datasets
Implementation Details
...and 5 more sections

Figures (8)

Figure 1: Example of our method: We provide an agent with a question and the agent proceeds to explore an unknown environment. When it encounters a potential target object, it verifies if this is indeed the correct target object through image-text matching (ITM). If the ITM score falls below a pre-determined threshold, the agent continues its exploration. If the ITM score exceeds the threshold, the agent stops exploration and performs VQA.
Figure 2: Map-based Modular Embodied Question Answering Model Overview. The proposed method comprises the Navigation module (outlined in blue) and the VQA module (outlined in red). The Navigation module consists of the Perception module and a set of Policies. The Perception module incrementally builds a 2D map, storing images along with their image-text matching scores. The Global Policy selects a long-term goal based on the 2D map and its frontiers. The Deterministic Local Policy outputs actions, and finally, the VQA module provides an answer based on the memorized images and the given question.
Figure 3: Dataset pre-processing using gpt-35-turbo-0613. It extracts a target object category from a given question for ObjNav and converts a question into a declarative text for image-text matching.
Figure 4: VQA top-1 accuracy on MP3D-EQA train'. The scores of LLaVA-v1.5-7b and LLaVA-v1.5-13b are higher than those of others.
Figure 5: ROC Curves of Image-text Matching of MP3D-EQA EqaMatterport at 'train' split.
...and 3 more figures

Map-based Modular Approach for Zero-shot Embodied Question Answering

TL;DR

Abstract

Map-based Modular Approach for Zero-shot Embodied Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)