Table of Contents
Fetching ...

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, Gaurav S. Sukhatme

TL;DR

This work tackles Vision-and-Language Navigation (VLN) in unseen, diverse indoor environments by proposing a fully zero-shot approach that uses CLIP to ground natural referring expressions without finetuning. It introduces a three-stage pipeline—instruction decomposition, CLIP-based grounding, and zero-shot sequential navigation—with two variants: CLIP-Nav and Seq CLIP-Nav (which incorporates backtracking via a Sequence Grounding Score). The methods achieve stronger generalization than supervised baselines on REVERIE, notably improving SPL on unseen data and reducing the Relative Change in Success (RCS). The results demonstrate that zero-shot CLIP grounding can drive robust, sequential navigation across new environments, suggesting practical potential for adaptable embodied agents.

Abstract

Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP's capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

TL;DR

This work tackles Vision-and-Language Navigation (VLN) in unseen, diverse indoor environments by proposing a fully zero-shot approach that uses CLIP to ground natural referring expressions without finetuning. It introduces a three-stage pipeline—instruction decomposition, CLIP-based grounding, and zero-shot sequential navigation—with two variants: CLIP-Nav and Seq CLIP-Nav (which incorporates backtracking via a Sequence Grounding Score). The methods achieve stronger generalization than supervised baselines on REVERIE, notably improving SPL on unseen data and reducing the Relative Change in Success (RCS). The results demonstrate that zero-shot CLIP grounding can drive robust, sequential navigation across new environments, suggesting practical potential for adaptable embodied agents.

Abstract

Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP's capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).
Paper Structure (10 sections, 1 equation, 8 figures, 1 table)

This paper contains 10 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Comparing Model Performance on REVERIE Seen and Unseen Splits: Observe the significant drop in performance of the Unseen Val and Unseen Test sets (blue) when compared with the Val Seen set (green), across all approaches as measured by both Success Rate (SR) (Left) and Success Weighted by Path Length (SPL) (Right). Also observe the poor Unseen Test set performance when compared to the human baseline.
  • Figure 2: Household images generated from text using a latent diffusion model latdiff. Observe the variance in layout, positioning and lighting. Each home is visually unique, and we hypothesize that this causes people to use unique, environment-specific language in giving out instructions (Orange-striped bathroom for instance). This hypothesis is substantiated by our inspection of REVERIE, and motivates us to treat VLN as a fully zero-shot problem.
  • Figure 3: We look at CLIP's ability to make sequential navigational decisions. Here, the instruction "Go to the kitchen" suggests that the agent needs to leave the room. However, in order for it to make this decision, it needs to ground this instruction within the panorama, to choose a view with the door leading outwards. Notice that there are no clear visual entities (i.e. spoons or sinks) to suggest the chosen image (in red) is a "kitchen". The decision is based on pretrained CLIP's structural prior of the household in picking a view that might lead to the kitchen.
  • Figure 4: CLIP Grounding - We ground the Navigational Component (NC) on all the split images to obtain Keyphrase Grounding Scores (KGS). The "CLIP-chosen image" (highlighted in red) represents the one with the highest KGS, which drives our navigation algorithms. We also simultaneously ground the AC, and use the grounding score to determine if the agent has reached the target location---our "Stop Condition".
  • Figure 5: CLIP-Nav - We present a novel approach for zero-shot VLN, that utilizes CLIP to make sequential navigational decisions. At each timestep, a CLIP-Chosen Image is determined by grounding the current NC to each of the panoramic splits. The chosen image represents the direction our model has chosen for zero-shot navigation. In this case, it refers to the bedroom potentially being somewhere in the chosen direction. The AC grounding score gives us a stopping threshold for when to our agent believes it has reached the target. CLIP-Nav runs iteratively until this threshold is reached.
  • ...and 3 more figures