Table of Contents
Fetching ...

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, Liang Lin

TL;DR

Embodied Question Answering (EQA) requires agents to actively explore 3D environments and ground answers in observations, but existing datasets incur biases and fail to optimize exploration. The authors present EXPRESS-Bench, the largest exploration-aware EQA benchmark with 777 trajectories and 2,044 QA pairs, plus Fine-EQA, a two-stage exploration framework that blends frontier-based and goal-oriented strategies. They introduce the Exploration-Answer Consistency (EAC) metric, integrating per-step grounding scores $oldsymbol{c3_i}$ and $\delta_i$ to jointly evaluate exploration quality and answer fidelity via $C$ and $E_{path}$. Extensive experiments show EXPRESS-Bench and Fine-EQA outperform prior baselines, delivering more faithful evaluations and improved navigation toward task-relevant information for robust reasoning.

Abstract

Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier-based exploration strategies struggle in cluttered environments and fail to ensure fine-grained exploration of task-relevant areas. To address these challenges, we construct the EXPloration-awaRe Embodied queStion anSwering Benchmark (EXPRESS-Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS-Bench consists of 777 exploration trajectories and 2,044 question-trajectory pairs. To improve exploration efficiency, we propose Fine-EQA, a hybrid exploration model that integrates frontier-based and goal-oriented navigation to guide agents toward task-relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration-Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state-of-the-art EQA models demonstrate the effectiveness of our EXPRESS-Bench in advancing embodied exploration and question reasoning.

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

TL;DR

Embodied Question Answering (EQA) requires agents to actively explore 3D environments and ground answers in observations, but existing datasets incur biases and fail to optimize exploration. The authors present EXPRESS-Bench, the largest exploration-aware EQA benchmark with 777 trajectories and 2,044 QA pairs, plus Fine-EQA, a two-stage exploration framework that blends frontier-based and goal-oriented strategies. They introduce the Exploration-Answer Consistency (EAC) metric, integrating per-step grounding scores and to jointly evaluate exploration quality and answer fidelity via and . Extensive experiments show EXPRESS-Bench and Fine-EQA outperform prior baselines, delivering more faithful evaluations and improved navigation toward task-relevant information for robust reasoning.

Abstract

Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier-based exploration strategies struggle in cluttered environments and fail to ensure fine-grained exploration of task-relevant areas. To address these challenges, we construct the EXPloration-awaRe Embodied queStion anSwering Benchmark (EXPRESS-Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS-Bench consists of 777 exploration trajectories and 2,044 question-trajectory pairs. To improve exploration efficiency, we propose Fine-EQA, a hybrid exploration model that integrates frontier-based and goal-oriented navigation to guide agents toward task-relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration-Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state-of-the-art EQA models demonstrate the effectiveness of our EXPRESS-Bench in advancing embodied exploration and question reasoning.

Paper Structure

This paper contains 30 sections, 16 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Comparison of our EXPRESS-Bench with other EQA datasets. The orange trajectory in the top-down map shows a complete exploration path from EXPRESS-Bench, with observation images at key waypoints (top-right). Data for this path is in the orange box. The blue trajectory simulates OpenEQA's episodic memory, passing near the target but not ending there. The yellow box simulates how multiple-choice data is generated in HM-EQA, lacking the exploration path. For each question, answers are based on visual observations at the endpoint, scored according to each dataset’s evaluation method. Unlike HM-EQA and OpenEQA, which may give higher scores based on answer similarity, EXPRESS-Bench adjusts scores for incorrect or fabricated answers by grounding them in the agent’s observations.
  • Figure 2: The construction process of EXPRESS-Bench.
  • Figure 3: Overview of the EXPRESS-Bench statistics.
  • Figure 4: Exploration-Answer Consistency Metric.
  • Figure 5: The Fine-EQA framework operates as follows: The agent initially performs coarse-grained exploration using a frontier-based strategy, then switches to goal-oriented fine-grained exploration once task-relevant regions are identified. A maximum exploration limit per region prevents excessive searching, prompting the agent to either return to frontier-based exploration or focus on the next most promising region. Throughout this process, the VLM continuously evaluates the relevance and completeness of the acquired information, guiding the agent's decision to either continue exploration or generate answers based on the most recent visual inputs, as detailed in the Appendix.
  • ...and 13 more figures