Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation
Junyu Gao, Xuan Yao, Changsheng Xu
TL;DR
This work tackles online Vision-and-Language Navigation by introducing Fast-Slow Test-Time Adaptation (FSTTA), a two-phase approach that jointly analyzes gradients and parameter trajectories to achieve rapid adaptation while maintaining long-term stability. The fast phase uses gradient decomposition-accumulation via SVD to derive concordant update directions, with dynamic learning-rate scaling guided by gradient variance. The slow phase reuses historical parameter states to perform a parameter-trajectory decomposition-accumulation, yielding a stable optimization path. Across four VLN benchmarks, FSTTA yields consistent improvements over baselines and several TTA methods, demonstrating practical online adaptation with controlled forgetting and improved navigation performance.
Abstract
The ability to accurately comprehend natural language instructions and navigate to the target location is essential for an embodied agent. Such agents are typically required to execute user instructions in an online manner, leading us to explore the use of unlabeled test samples for effective online model adaptation. However, for online Vision-and-Language Navigation (VLN), due to the intrinsic nature of inter-sample online instruction execution and intra-sample multi-step action decision, frequent updates can result in drastic changes in model parameters, while occasional updates can make the model ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing joint decomposition-accumulation analysis for both gradients and parameters in a unified framework. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks. Code is available at https://github.com/Feliciaxyao/ICML2024-FSTTA.
