AIGeN: An Adversarial Approach for Instruction Generation in VLN
Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara
TL;DR
The paper tackles the data bottleneck in Vision-and-Language Navigation (VLN) by generating high-quality synthetic instructions to augment training data. AIGeN is a GAN-like model that uses a GPT-2 decoder to generate instructions from image sequences and a BERT-based discriminator to identify real versus fake instructions, trained with unlabeled navigation paths and Mask2Former detections. Training on HM3D trajectories and evaluating with image-description metrics and downstream VLN performance, the authors report improved navigation results and state-of-the-art performance on REVERIE and R2R. AIGeN's synthetic data substantially boosts VLN models, supporting broader adoption of synthetic annotation for embodied AI tasks.
Abstract
In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance.
