Table of Contents
Fetching ...

Explore until Confident: Efficient Exploration for Embodied Question Answering

Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, Dorsa Sadigh

TL;DR

This work tackles Embodied Question Answering by integrating a vision-language model with an external semantic map to guide targeted exploration, and by applying multi-step conformal prediction to calibrate stopping decisions. It introduces semantic-value weighting (LSV and GSV) derived from VLM prompts to steer exploration toward informative regions, and a CP-based framework to guarantee calibrated confidence when deciding to stop. The HM-EQA dataset based on HM3D enables realistic simulation and hardware experiments, showing that semantic prompting plus calibrated stopping reduces interaction steps while maintaining high answer accuracy. Overall, the approach demonstrates that coupling semantic reasoning with uncertainty-aware stopping significantly improves EQA efficiency and reliability in diverse environments.

Abstract

We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/

Explore until Confident: Efficient Exploration for Embodied Question Answering

TL;DR

This work tackles Embodied Question Answering by integrating a vision-language model with an external semantic map to guide targeted exploration, and by applying multi-step conformal prediction to calibrate stopping decisions. It introduces semantic-value weighting (LSV and GSV) derived from VLM prompts to steer exploration toward informative regions, and a CP-based framework to guarantee calibrated confidence when deciding to stop. The HM-EQA dataset based on HM3D enables realistic simulation and hardware experiments, showing that semantic prompting plus calibrated stopping reduces interaction steps while maintaining high answer accuracy. Overall, the approach demonstrates that coupling semantic reasoning with uncertainty-aware stopping significantly improves EQA efficiency and reliability in diverse environments.

Abstract

We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/
Paper Structure (28 sections, 12 equations, 20 figures)

This paper contains 28 sections, 12 equations, 20 figures.

Figures (20)

  • Figure 1: Given a question about the scene ("Is the dishwasher in the kitchen open? A) Yes B) No"), our framework leverages a large vision-language model (VLM) to obtain semantic information from the views (visualized by overlaying it on top of the occupancy map), which guides a Fetch robot to explore relevant locations. Using such a semantic map helps robot explore more efficiently compared to Frontier-based exploration without using any semantic value (FBE, \ref{['sec:experiments']}). The robot maintains a set of possible answers and stops when the set reduces to a single answer based on the current view. In this example, the robot is confident at Step 16 where it sees the open dishwasher not too far from its position. The robot paths (thin lines) are approximated.
  • Figure 2: Overview of our framework for EQA tasks, which combines a VLM and a external semantic map for planning.
  • Figure 3: To query VLM's uncertainty over possible exploration locations, we visually prompt the VLM with possible points in the current view (left column) and also with the entire view (middle column) to obtain the Local Semantic Value (LSV) and Global Semantic Value (GSV) (\ref{['subsec:semantic-value']}). A weighted combination of them (SV) is saved in a semantic map (right column). The values are used as the weights for sampling the next frontier to navigate to, guiding the robot towards unknown and relevant regions (\ref{['subsec:semantic-fbe']})
  • Figure 4: For determining when to stop, we apply a principled approach based on multiple-step conformal prediction: at each step, a prediction set is generated, and the robot keeps the intersection of the sets until there is only one option remaining. The correct answer is guaranteed to be the remaining one with user-specified probability.
  • Figure 5: Sample scenarios from the HM-EQA dataset. The images are the views used by the robot to determine the final answer in our experiments. The boxes show the questions with the true answers bolded.
  • ...and 15 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof