Table of Contents
Fetching ...

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

Haochen Zhang, Nirav Savaliya, Faizan Siddiqui, Enna Sachdeva

TL;DR

FAST-EQA addresses Embodied Question Answering by combining semantics-guided global and local exploration with a bounded, per-target visual memory and a Chain-of-Thought reasoning module. The approach extracts candidate regions and targets via an LLM, navigates using a doorway-aware frontier strategy and local region refinement, and retrieves a compact set of most relevant observations per target for final QA with a VLM in a CoT framework. Empirical results show state-of-the-art performance on HM-EQA and EXPRESS-Bench, competitive results on OpenEQA and MT-HM3D, and improved real-time inference speed with a bounded memory footprint, enabling practical deployment on embodied agents. Limitations include reliance on current VLM spatial reasoning and variance in model reasoning, motivating future work on memory representations that compress and stabilize scene information while preserving reasoning quality.

Abstract

Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

TL;DR

FAST-EQA addresses Embodied Question Answering by combining semantics-guided global and local exploration with a bounded, per-target visual memory and a Chain-of-Thought reasoning module. The approach extracts candidate regions and targets via an LLM, navigates using a doorway-aware frontier strategy and local region refinement, and retrieves a compact set of most relevant observations per target for final QA with a VLM in a CoT framework. Empirical results show state-of-the-art performance on HM-EQA and EXPRESS-Bench, competitive results on OpenEQA and MT-HM3D, and improved real-time inference speed with a bounded memory footprint, enabling practical deployment on embodied agents. Limitations include reliance on current VLM spatial reasoning and variance in model reasoning, motivating future work on memory representations that compress and stabilize scene information while preserving reasoning quality.

Abstract

Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.
Paper Structure (17 sections, 5 equations, 5 figures, 3 tables)

This paper contains 17 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: In the illustrated scenario, our FAST-EQA agent first localizes relevant regions, such as the master bedroom and guest bedroom, and identifies the visual target: bedsheet. Guided by a semantic-aware global exploration strategy focused on relevant rooms, it navigates across these regions while maintaining and updating a target-specific memory based on visual relevance. Once sufficiently confident, the agent queries a large vision–language model (here, GPT-4o) to answer the question using the stored visual observations.
  • Figure 2: FAST-EQA processes the question (Q) by extracting relevant regions (R) and visual targets (T). At each step, it localizes its current region ($\mathrm{R}_{\mathrm{t}}$) and updates a semantic memory. For each target $T_m$, it maintains a dedicated memory $\psi_m$ that is refined using a visual relevance score. The agent employs a semantic frontier–guided global exploration strategy, leveraging narrow passages (e.g., doors and hallways) to effectively search for relevant semantic regions. When a relevant region is reached, it switches to local exploration to refine the target-specific memory. Once the stopping criterion is satisfied, the agent queries a VLM to generate the final answer. Dotted lines indicate module inputs while solid lines indicate procedural direction of the system.
  • Figure 3: FAST-EQA leverages language-aligned features from AM-RADIO together with a SigLIP adaptor, to direct the agent toward the target regions. For queries such as (a) Bathroom and (b) Bedroom, the predicted heatmaps are thresholded to produce white contour segments, while the red dot indicates the contour centroid to step towards. This visualization illustrates how semantic grounding enables precise localization of task-relevant areas in the environment to guide exploration from global to local.
  • Figure 4: FAST-EQA employs a bounded memory system that allocates a dedicated visual memory for each target, retaining the $k$ most relevant images (here, $k = 3$). The overall memory footprint scales only with the number of targets and remains constant over time, even in long-horizon tasks.
  • Figure 5: An example from EXPRESS-Bench illustrating how FAST-EQA identifies the relevant region $R$ and target $T$ from the question $Q$. It then explores the scene, and once the stopping condition is met, correctly generates the final answer from the retrieved visual memory.