Table of Contents
Fetching ...

Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

Yunbo Xu, Xuesong Zhang, Jia Li, Zhenzhen Hu, Richang Hong

TL;DR

This work tackles VLN generalization by explicitly disentangling foreground and background information from visual observations and applying an online augmentation strategy. The proposed COFA framework uses semantic landmark-based foreground extraction, spatially disentangled masks, and CLIP features, followed by a consensus-driven two-stage voting mechanism to select the most informative feature at each viewpoint without changing the model architecture. Empirical results on R2R and REVERIE show state-of-the-art performance and robust generalization, with ablations confirming the benefit of both foreground and background features and the effectiveness of the consensus voting over stochastic baselines. The approach offers a practical, low-cost path to improve VLN systems and can be readily extended to other VLN methods.

Abstract

Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

TL;DR

This work tackles VLN generalization by explicitly disentangling foreground and background information from visual observations and applying an online augmentation strategy. The proposed COFA framework uses semantic landmark-based foreground extraction, spatially disentangled masks, and CLIP features, followed by a consensus-driven two-stage voting mechanism to select the most informative feature at each viewpoint without changing the model architecture. Empirical results on R2R and REVERIE show state-of-the-art performance and robust generalization, with ablations confirming the benefit of both foreground and background features and the effectiveness of the consensus voting over stochastic baselines. The approach offers a practical, low-cost path to improve VLN systems and can be readily extended to other VLN methods.

Abstract

Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

Paper Structure

This paper contains 13 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The overview of the proposed COFA: a) we extract foreground and background features by identifying spatially disentangled regions through foreground landmark identification; b) online augmentation at the viewpoint level using two-stage voting for preferred augmentation features; c) the proposed online augmented features can be seamlessly integrated into a generic navigation pipeline.
  • Figure 2: The quantitative analysis of viewpoint-level features preference across different VLN datasets and splits.