Table of Contents
Fetching ...

PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin

TL;DR

PanoGen++ tackles data scarcity in Vision-and-Language Navigation by adapting a pre-trained diffusion model to the VLN domain using LoRA-based, parameter-efficient fine-tuning guided by a VLN-specific image-text corpus. It introduces two scene-generation modes—masked image inpainting and recursive image outpainting—to create diverse, coherent panoramic environments tailored to VLN tasks for both pre-training and fine-tuning. Empirical results across R2R, R4R, and CVDN demonstrate state-of-the-art improvements in navigation metrics and goal-oriented progress, highlighting improved generalization to unseen environments. The work demonstrates the practical impact of domain-specific synthetic data and efficient diffusion-model adaptation for embodied AI tasks.

Abstract

Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

TL;DR

PanoGen++ tackles data scarcity in Vision-and-Language Navigation by adapting a pre-trained diffusion model to the VLN domain using LoRA-based, parameter-efficient fine-tuning guided by a VLN-specific image-text corpus. It introduces two scene-generation modes—masked image inpainting and recursive image outpainting—to create diverse, coherent panoramic environments tailored to VLN tasks for both pre-training and fine-tuning. Empirical results across R2R, R4R, and CVDN demonstrate state-of-the-art improvements in navigation metrics and goal-oriented progress, highlighting improved generalization to unseen environments. The work demonstrates the practical impact of domain-specific synthetic data and efficient diffusion-model adaptation for embodied AI tasks.

Abstract

Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Domain-adapted environment generation framework (i.e., PanoGen++) tailored to MP3D for VLN agent training. In this setup, the agent receives instructions and navigates from a specified trajectory. The environment includes both original MP3D scenes and environments generated by PanoGen++, strengthening the agent's generalization to unseen settings.
  • Figure 2: Illustration of domain adaptation for environment generation with VLN. (a) Motivation for adapting PanoGen++ to VLN environments and (b) training pipeline of PanoGen++ (here, given an image from the original VLN environments, it is first encoded into the latent space, following LDM rombach2022high. The environments are then augmented using our generation module, exemplified by the inpainting process, which requires two additional inputs: a mask and a masked image. During training, the VAE and U-Net weights are frozen, and trainable adaption modules are added).
  • Figure 3: Overview of VLN agent training utilizing PanoGen++ for enhanced environment augmentation. (a) Room panorama captioning for domain-specific adaptation. (b) Masked image inpainting for semantic alignment. (c) Recursive image outpainting for coherent panoramic extension. (Here, settings for recursive image outpainting follow PanoGen li2023panogen for fair comparison, with the addition of a novel masked image inpainting technique for enhanced environment augmentation.)
  • Figure 4: Qualitative analysis of the panoramic environments generated by our PanoGen++. Here, the Matterport3D serves as the original environment for VLN tasks, while PanoGen represents the panoramic environments generated by the PanoGen.
  • Figure 5: Qualitative analysis of the panoramic environments generated on the Cityscapes dataset. Here, the original Cityscapes images serve as the reference, while PanoGen++ and PanoGen represent the generated panoramic environments by our method and the baseline respectively.
  • ...and 2 more figures