Table of Contents
Fetching ...

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Akhil Perincherry, Jacob Krantz, Stefan Lee

TL;DR

This work investigates whether diffusion-generated visual imaginations of instruction landmarks can improve VLN agents. By segmenting instructions into sub-goals, filtering for informative nouns, and generating corresponding imagery, the authors train a model-agnostic imagination encoder and apply a cosine-alignment loss to ground imaginations to language. Integrated into HAMT and DUET, the approach yields consistent, though modest, improvements in SR and SPL on R2R and REVERIE, with sequential imaginations and proper alignment providing the strongest gains. The results suggest imaginations can reinforce visual grounding in VLN and point to future work in Sim2Real transfer and lifelong grounding of visual concepts.

Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at https://www.akhilperincherry.com/VLN-Imagine-website/.

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

TL;DR

This work investigates whether diffusion-generated visual imaginations of instruction landmarks can improve VLN agents. By segmenting instructions into sub-goals, filtering for informative nouns, and generating corresponding imagery, the authors train a model-agnostic imagination encoder and apply a cosine-alignment loss to ground imaginations to language. Integrated into HAMT and DUET, the approach yields consistent, though modest, improvements in SR and SPL on R2R and REVERIE, with sequential imaginations and proper alignment providing the strongest gains. The results suggest imaginations can reinforce visual grounding in VLN and point to future work in Sim2Real transfer and lifelong grounding of visual concepts.

Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at https://www.akhilperincherry.com/VLN-Imagine-website/.

Paper Structure

This paper contains 17 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of visual imaginations. (Top) A natural language instruction specifying sub-goals pool table, kitchen, and bedroom. (Bottom) Visual imaginations of landmarks pool table, kitchen and bedroom referenced by the sub-goals in the instruction. In our work, we study if these visual imaginations generated using text-to-image models can improve performance in VLN.
  • Figure 2: An overview of our approach. (Left) Imaginations generated using valid sub-instructions from an instruction as determined by our filtering scheme are first passed to a pre-trained ViT to obtain feature vectors. A type embedding $t_{Im}$ for imagination modality is then added to the features which are encoded using a 3 layer MLP to obtain imagination embeddings $h_i$. (Right) To integrate imagination modality to a VLN agent, the imagination embeddings $h_i$ are concatenated with instruction embeddings $t_i$ that are encoded using a text encoder $f_T(W)$. The concatenated imagination-text embeddings are passed to the VLN agent's cross-modal encoder $f_X$ along with visual embeddings to predict a distribution over the agent's action space.
  • Figure 3: Example instruction segmentation, filtering, and image generation. Instructions are segmented to sub-instructions leveraging FG-R2R fgr2r and then filtered to remove phrases referring to uninformative nouns (e.g., "right"). We produce visual imaginations using SDXL podell2023sdxl for remaining sub-instructions.
  • Figure 4: Qualitative examples showing imaginations as pivots between language and observation images. The first column contains sub-instruction from a random instruction from R2R, the second column contains the imagination generated using the sub-instruction. The third and fourth columns show highest attended language tokens and observation images from an attention head in HAMT's cross-modal transformer at a time step the associated observations are first visible. In the second example (row 2), the sub-instruction references "unicycle" which is captured in the imagination along with neighboring nouns "easel" and "door". We observe that in a head where top attending language tokens to the imagination query are references to nouns associated with the sub-instruction, its top attended observations to the imagination query are images of the same concept ("unicycle"). In this example, the imagination of a unicycle is being used to associate language tokens belonging to "unicycle" to observations of unicycle hinting at the utility of imaginations in navigation.
  • Figure 5: Integration of imagination embeddings to HAMT and DUET. The imagination embeddings are concatenated with language embeddings before passing through cross-modal encoders.
  • ...and 2 more figures