Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering
Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta
TL;DR
This work tackles frontier oscillations in step-level exploration for embodied question answering by introducing Prune-Then-Plan, a calibration-based framework that uses the VLM solely to prune unlikely frontiers and defers final decisions to a coverage-based planner. It builds per-step calibrated signals by converting VLM confidences into p-values through an empirical calibration of bad-frontier scores via ECDFs and Holm-Bonferroni-style pruning. The approach is integrated with the 3D-Mem EQA pipeline and demonstrates substantial gains in visually grounded navigation and answering metrics across OpenEQA and EXPRESS-Bench, including improved scene coverage and reduced path curvature under equal exploration budgets. Overall, step-level calibration provides a principled, interpretable mechanism to stabilize exploration in VLM–driven EQA systems with practical impact for reliable, efficient embodied reasoning.
Abstract
Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.
