Table of Contents
Fetching ...

LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation

Nitesh Subedi, Adam Haroon, Samuel Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

TL;DR

LCLA reframes embodied navigation as a representation alignment problem by training a privileged expert policy with full state information, freezing its latent interface, and learning a lightweight adapter that maps vision–language inputs into that latent space. The approach decouples perception from control, enabling modular reuse of frozen control across sensing modalities and environments while maintaining robust in- and out-of-distribution performance. Controlled ablations show that explicit language conditioning and latent alignment jointly are necessary for strong results, as end-to-end imitation or latent alignment alone underperform. Empirically, LCLA achieves high in-distribution success with minimal latency and exhibits strong zero-shot generalization to unseen environments, lighting, and viewpoints, illustrating the practical value of task-centric latent interfaces in vision–language navigation.

Abstract

We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language indoor navigation task, where aligned latent spaces yield strong in-distribution performance and robust zero-shot generalization to unseen environments, lighting conditions, and viewpoints while remaining lightweight at inference time.

LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation

TL;DR

LCLA reframes embodied navigation as a representation alignment problem by training a privileged expert policy with full state information, freezing its latent interface, and learning a lightweight adapter that maps vision–language inputs into that latent space. The approach decouples perception from control, enabling modular reuse of frozen control across sensing modalities and environments while maintaining robust in- and out-of-distribution performance. Controlled ablations show that explicit language conditioning and latent alignment jointly are necessary for strong results, as end-to-end imitation or latent alignment alone underperform. Empirically, LCLA achieves high in-distribution success with minimal latency and exhibits strong zero-shot generalization to unseen environments, lighting, and viewpoints, illustrating the practical value of task-centric latent interfaces in vision–language navigation.

Abstract

We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language indoor navigation task, where aligned latent spaces yield strong in-distribution performance and robust zero-shot generalization to unseen environments, lighting conditions, and viewpoints while remaining lightweight at inference time.
Paper Structure (81 sections, 38 equations, 5 figures, 4 tables)

This paper contains 81 sections, 38 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of latent representation alignment with a frozen privileged policy. An expert policy $\pi$ is first trained using privileged state observations (e.g., object poses, robot pose, and environment geometry) and induces a task-relevant latent representation $\mathbf{z} \in \mathbb{R}^D$ sufficient for control. Rollouts collected from this policy provide supervision for learning a lightweight adapter on top of a pre-trained vision--language model (VLM). The adapter aligns image and text embeddings to the expert’s latent space, while the expert policy and action head remain frozen. At deployment, actions are produced by mapping visual and language inputs through the VLM and adapter into the expert-defined latent space and reusing the frozen action head.
  • Figure 2: Architecture of the Language Conditioned Latent Alignment Adapter (LCLAA). The model takes image patches and a text embedding as input. (1) Patches are first contextualized via self-attention. (2) A spatial bottleneck then uses text-conditioned importance scores to select relevant visual context (soft selection). (3) A query generation module combines the text embedding with learnable queries. (4) These queries attend to the selected visual context through stacked cross-attention blocks. (5) Finally, a gated fusion mechanism combines the processed queries with the original text residual to produce the aligned latent representation $Z$
  • Figure 3: The left panels show two example indoor environments (Room A and Room B) with diverse furniture, objects, and layouts, illustrating the visual complexity encountered during training. The right panel summarizes the structured language prompt templates used for adapter training. Controlled linguistic variation and spatial relations encourage compositional understanding and support generalization to OOD objects, layouts, and relational configurations at evaluation time.
  • Figure 4: Illumination intensity perturbations in the out-of-distribution Room B environment. Egocentric RGB observations are shown under very high, high, default, low, and very low lighting conditions. These controlled changes in global illumination significantly alter scene appearance, contrast, and shadowing, and are used to evaluate robustness to lighting variation.
  • Figure 5: Camera height perturbations in the out-of-distribution Room B environment. The agent’s egocentric RGB observations are shown for a lowered camera ($-0.2$,m), the default camera height, and a raised camera ($+0.2$,m). These perturbations induce significant changes in visible floor area, object scale, and perspective, serving as a controlled test of viewpoint robustness.