LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation
Nitesh Subedi, Adam Haroon, Samuel Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar
TL;DR
LCLA reframes embodied navigation as a representation alignment problem by training a privileged expert policy with full state information, freezing its latent interface, and learning a lightweight adapter that maps vision–language inputs into that latent space. The approach decouples perception from control, enabling modular reuse of frozen control across sensing modalities and environments while maintaining robust in- and out-of-distribution performance. Controlled ablations show that explicit language conditioning and latent alignment jointly are necessary for strong results, as end-to-end imitation or latent alignment alone underperform. Empirically, LCLA achieves high in-distribution success with minimal latency and exhibits strong zero-shot generalization to unseen environments, lighting, and viewpoints, illustrating the practical value of task-centric latent interfaces in vision–language navigation.
Abstract
We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language indoor navigation task, where aligned latent spaces yield strong in-distribution performance and robust zero-shot generalization to unseen environments, lighting conditions, and viewpoints while remaining lightweight at inference time.
