Active Test-time Vision-Language Navigation

Heeju Ko; Sungjune Kim; Gyeongrok Oh; Jeongyoon Yoon; Honglak Lee; Sujin Jang; Seungryong Kim; Sangpil Kim

Active Test-time Vision-Language Navigation

Heeju Ko, Sungjune Kim, Gyeongrok Oh, Jeongyoon Yoon, Honglak Lee, Sujin Jang, Seungryong Kim, Sangpil Kim

TL;DR

The paper addresses vision-language navigation under distribution shifts at test time by introducing ATENA, a framework that integrates active human feedback with self-supervised signals to calibrate uncertainty during online navigation. ATENA combines Mixture Entropy Optimization, which conditions entropy on episodic outcomes using a mixture of the policy and a pseudo-expert distribution, with Self-Active Learning, guided by an uncertainty-based query strategy and a self-prediction head. The approach updates the policy with a joint objective that balances feedback-driven and self-supervised signals, and selectively queries feedback to maximize information while minimizing labeling cost. Empirical results on REVERIE, R2R, and R2R-CE show that ATENA outperforms strong baselines and existing TTA methods, improving robustness to test-time distribution shifts and uncertainty calibration. Overall, the work provides a scalable framework for interactive and autonomous test-time adaptation in VLN and related embodied AI tasks.

Abstract

Vision-Language Navigation (VLN) policies trained on offline datasets often exhibit degraded task performance when deployed in unfamiliar navigation environments at test time, where agents are typically evaluated without access to external interaction or feedback. Entropy minimization has emerged as a practical solution for reducing prediction uncertainty at test time; however, it can suffer from accumulated errors, as agents may become overconfident in incorrect actions without sufficient contextual grounding. To tackle these challenges, we introduce ATENA (Active TEst-time Navigation Agent), a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes. In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration. Here, we propose mixture entropy optimization, where entropy is obtained from a combination of the action and pseudo-expert distributions-a hypothetical action distribution assuming the agent's selected action to be optimal-controlling both prediction confidence and action preference. In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions. As a result, the agent stays actively engaged throughout all iterations, leading to well-grounded and adaptive decision-making. Extensive evaluations on challenging VLN benchmarks-REVERIE, R2R, and R2R-CE-demonstrate that ATENA successfully overcomes distributional shifts at test time, outperforming the compared baseline methods across various settings.

Active Test-time Vision-Language Navigation

TL;DR

Abstract

Active Test-time Vision-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)