Table of Contents
Fetching ...

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang

TL;DR

This work tackles data scarcity in Vision-Language Navigation by introducing RAM, a rewriting-driven data augmentation framework that generates unseen observation-instruction pairs without extra simulators or web data. It fuses Object-Enriched Observation Rewriting and Observation-Contrast Instruction Rewriting, powered by Vision-Language Models, Large Language Models, and Text-to-Image Generation, then trains with a mixing-then-focusing strategy and random observation cropping to diversify data while mitigating noise. RAM demonstrates strong generalization across multiple VLN benchmarks, including transfer to continuous environments, and achieves competitive results with far less augmented data than prior large-scale approaches. The approach highlights the practical potential of foundation-model driven data generation for embodied AI and suggests future directions in efficient fine-tuning and interactive learning for VLN data augmentation.

Abstract

Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

TL;DR

This work tackles data scarcity in Vision-Language Navigation by introducing RAM, a rewriting-driven data augmentation framework that generates unseen observation-instruction pairs without extra simulators or web data. It fuses Object-Enriched Observation Rewriting and Observation-Contrast Instruction Rewriting, powered by Vision-Language Models, Large Language Models, and Text-to-Image Generation, then trains with a mixing-then-focusing strategy and random observation cropping to diversify data while mitigating noise. RAM demonstrates strong generalization across multiple VLN benchmarks, including transfer to continuous environments, and achieves competitive results with far less augmented data than prior large-scale approaches. The approach highlights the practical potential of foundation-model driven data generation for embodied AI and suggests future directions in efficient fine-tuning and interactive learning for VLN data augmentation.

Abstract

Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.

Paper Structure

This paper contains 30 sections, 8 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison of our RAM with typical VLN data augmentation approaches: (a) Augmentation from original/additional simulators and (b) Augmentation from web images or videos. Rather than these methods that may be limited in specific simulator environments or struggle with tedious data cleaning, our RAM (c) empowers simulator-free and labor-saving data augmentation by rewriting human-annotated data to generate unseen observation-instruction pairs. The generated scene objects and the newly introduced objects mentioned in the rewritten instructions are denoted by blue boxes/fonts. New actional representations in the rewritten instructions are denoted by red underlined fonts. The mixing-then-focusing training strategy is omitted in this figure.
  • Figure 2: Overview of our Rewriting-driven AugMentation (RAM) paradigm. For Object-Enriched Observation Rewriting, we collect object-enriched rewritten scene descriptions based on VLMs and LLMs. Then we feed the rewritten descriptions to T2IMs for synthesizing new observations via an efficient panorama-to-view scheme. During Observation-Contrast Instruction Rewriting, we ask the LLMs to perform observation contrast by reasoning the difference between original and new observation descriptions to generate new instructions. We further introduce a mixing-then-focusing strategy with a random observation cropping scheme for combining our rewritten trajectory-instruction pairs with human-annotated data for training. Newly generated objects and actional representations in the rewritten instruction are denoted in blue and red underlined fonts, respectively.
  • Figure 3: Prompts for Object-Enriched Scene Description Rewriting and Observation-Contrast Instruction Rewriting.
  • Figure 4: Ablation results for mixed training strategy on R2R dataset. "1:1", "1:3", and "1:5" represent the data mixing ratio between the original human-annotated data and our rewritten data. "RdCrop(1:3)" means using our random observation cropping scheme with the data mixing ratio of 1:3.
  • Figure 5: Visualization examples of rewritten object-enriched scene description, generated panorama, extracted ground-truth sequence from the generated panorama, and the rewritten instruction. Newly generated scene objects and modality-aligned objects in the instruction are denoted in red boxes and bold fonts, respectively.
  • ...and 4 more figures