Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

Junyu Gao; Xuan Yao; Changsheng Xu

Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

Junyu Gao, Xuan Yao, Changsheng Xu

TL;DR

This work tackles online Vision-and-Language Navigation by introducing Fast-Slow Test-Time Adaptation (FSTTA), a two-phase approach that jointly analyzes gradients and parameter trajectories to achieve rapid adaptation while maintaining long-term stability. The fast phase uses gradient decomposition-accumulation via SVD to derive concordant update directions, with dynamic learning-rate scaling guided by gradient variance. The slow phase reuses historical parameter states to perform a parameter-trajectory decomposition-accumulation, yielding a stable optimization path. Across four VLN benchmarks, FSTTA yields consistent improvements over baselines and several TTA methods, demonstrating practical online adaptation with controlled forgetting and improved navigation performance.

Abstract

The ability to accurately comprehend natural language instructions and navigate to the target location is essential for an embodied agent. Such agents are typically required to execute user instructions in an online manner, leading us to explore the use of unlabeled test samples for effective online model adaptation. However, for online Vision-and-Language Navigation (VLN), due to the intrinsic nature of inter-sample online instruction execution and intra-sample multi-step action decision, frequent updates can result in drastic changes in model parameters, while occasional updates can make the model ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing joint decomposition-accumulation analysis for both gradients and parameters in a unified framework. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks. Code is available at https://github.com/Feliciaxyao/ICML2024-FSTTA.

Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure (10 sections, 11 equations, 3 figures, 8 tables)

This paper contains 10 sections, 11 equations, 3 figures, 8 tables.

Introduction
Related Work
Our Approach
Fast Update via Gradient Analysis
Slow Update via Parameter Analysis
Experimental Results
Comparison with Different TTA Strategies
Extensive Analysis of FSTTA
Comparison with State-of-the-art VLN Models
Conclusions

Figures (3)

Figure 1: (a) Illustration of online VLN. (b) Comparison between TTA strategies on REVERIE qi2020reverie validation unseen set using SPL and SR metrics. 'DUET' chen2022think is the base model, 'Frequent Update' means updating at certain intervals within each sample, 'Stable Update' refers to initializing with the original base model for each sample and using its best intra-sample update interval INT=1. All these strategies adopt TENT wang2020tent for model updates. The results show that overly fast or overly slow TTA fail to achieve significant improvements.
Figure 2: Overall framework of the proposed Fast-Slow Test-Time Adaptation (FSTTA) for online VLN. In the fast update phase, taking 'Sample i' as an example, the model periodically analyzes the gradients ($\{\bm g\}$) generated during the recent multi-step navigation and performs a gradient decomposition-accumulation analysis to pinpoint a concordant direction for model update. After a certain number of fast updates, historical model parameters ($\{\bm \Theta\}$) are recorded. In the slow update phase, we revert the model to its historical state and conduct a parameter decomposition-accumulation analysis to learn an optimization path for direct parameter modulation. Note that 'F', 'S' in the robots means the model parameters after fast and slow updates. '$\text{F}_1$' indicates the first fast update within a test sample.
Figure 3: Representative visual results on REVERIE validation unseen set. Yellow points denote start locations, while the directed lines with red and green points depict the predicted trajectories with target and incorrect endpoints, respectively. With FSTTA, the basic agent (DUET) demonstrates enhanced exploration capabilities, effectively moving towards the correct direction, and succeeds based on the object context and scene layouts.

Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

TL;DR

Abstract

Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)