Table of Contents
Fetching ...

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta

TL;DR

This work tackles frontier oscillations in step-level exploration for embodied question answering by introducing Prune-Then-Plan, a calibration-based framework that uses the VLM solely to prune unlikely frontiers and defers final decisions to a coverage-based planner. It builds per-step calibrated signals by converting VLM confidences into p-values through an empirical calibration of bad-frontier scores via ECDFs and Holm-Bonferroni-style pruning. The approach is integrated with the 3D-Mem EQA pipeline and demonstrates substantial gains in visually grounded navigation and answering metrics across OpenEQA and EXPRESS-Bench, including improved scene coverage and reduced path curvature under equal exploration budgets. Overall, step-level calibration provides a principled, interpretable mechanism to stabilize exploration in VLM–driven EQA systems with practical impact for reliable, efficient embodied reasoning.

Abstract

Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

TL;DR

This work tackles frontier oscillations in step-level exploration for embodied question answering by introducing Prune-Then-Plan, a calibration-based framework that uses the VLM solely to prune unlikely frontiers and defers final decisions to a coverage-based planner. It builds per-step calibrated signals by converting VLM confidences into p-values through an empirical calibration of bad-frontier scores via ECDFs and Holm-Bonferroni-style pruning. The approach is integrated with the 3D-Mem EQA pipeline and demonstrates substantial gains in visually grounded navigation and answering metrics across OpenEQA and EXPRESS-Bench, including improved scene coverage and reduced path curvature under equal exploration budgets. Overall, step-level calibration provides a principled, interpretable mechanism to stabilize exploration in VLM–driven EQA systems with practical impact for reliable, efficient embodied reasoning.

Abstract

Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

Paper Structure

This paper contains 17 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our method stabilizes VLM-guided exploration by calibrating frontier choices at the step level. Rather than letting the VLM directly pick the 'best’ frontier (as in 3D-Mem), we use the VLM only to flag and reject frontiers that are likely uninformative. Once pruning is complete, a coverage-based planner selects the next frontier from the remaining candidates. For instance, although the VLM might favor frontier ②, our calibration step rejects frontier ① as a bad option (since it leads the agent away from the kitchen) and then selects the closest viable frontier ③, enabling the agent to reach the correct, visually grounded answer more quickly. This separation of semantic pruning and coverage-based planning provides a principled balance, allowing the agent to explore efficiently while maintaining strong semantic relevance.
  • Figure 2: The agent traverses the scene and passes egocentric captures to the 3D-Mem world representation to update scene memory and compute frontiers. We subsequently query the VLM to assess its confidence in each frontier’s potential to move the agent closer to a correct answer. The resulting confidences are converted into step-normalized scores and then into p-values via our empirical cumulative distribution function to support pruning. Finally, we employ multiple hypothesis testing to detect and prune bad frontiers where $\alpha$ controls the aggressiveness of frontier pruning (larger $\alpha$ means more frontiers retained). The agent then proceeds towards the nearest surviving frontier and repeats the process.
  • Figure 3: Baseline methods with inefficient exploration strategies suffer from extreme failure cases and can directly hurt answer quality. As seen in the top row, the efficiency of our method allows it to reach supporting visual evidence within the step budget where-as 3D-Mem and Fine-EQA both fail because their inefficiencies prevent them from ever reaching the region of interest. In the second row we see that even when 3D-Mem answers correctly it can take much longer to do so. In the bottom row we see a failure case where the baseline methods both answer but do not provide answers visually grounded in the selected snapshots.
  • Figure 4: Mean observed voxels at step x for Express-Bench, obtained by averaging the number of voxels seen by all questions that reach that step.
  • Figure 5: Curves of our question category performance vs alpha value on our Express-Bench EXPRESSBench tuning set. For certain categories stricter or less strict filtering improves performance over the 3D-Mem baseline.