SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, Qi Wu
TL;DR
This work addresses the fragmentation of language-guided visual navigation into task-specific problems by proposing State-Adaptive Mixture of Experts (SAME), a sparse MoE that routes decisions based on the agent’s current state and multimodal observations. By unifying seven navigation tasks under a single framework and pretraining with VLN data, SAME achieves state-of-the-art or comparable results to specialized models in both discrete and continuous settings. The approach demonstrates strong cross-task generalization, with VLN pretraining notably boosting performance on several tasks, and ablations showing the importance of training strategy and MoE routing balance. Overall, SAME advances embodied AI by enabling versatile, instruction-grounded navigation across diverse environments and instruction granularities.
Abstract
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
