Table of Contents
Fetching ...

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

Junyou Zhu, Yanyuan Qiao, Siqi Zhang, Xingjian He, Qi Wu, Jing Liu

TL;DR

MiniVLN addresses the tension between Vision-and-Language Navigation performance and deployability by learning a compact student model through progressive two-stage knowledge distillation from a large VLN teacher. The approach distills fine-grained feature representations during pretraining and navigation-specific logits during finetuning, enabling the student to closely match the teacher with only about 12% of the parameters. On R2R and REVERIE, MiniVLN achieves comparable or superior results while running significantly faster on CPU, demonstrating practical viability for edge devices. This work provides a blueprint for compact, high-performing embodied AI models using staged distillation in multimodal navigation.

Abstract

In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

TL;DR

MiniVLN addresses the tension between Vision-and-Language Navigation performance and deployability by learning a compact student model through progressive two-stage knowledge distillation from a large VLN teacher. The approach distills fine-grained feature representations during pretraining and navigation-specific logits during finetuning, enabling the student to closely match the teacher with only about 12% of the parameters. On R2R and REVERIE, MiniVLN achieves comparable or superior results while running significantly faster on CPU, demonstrating practical viability for edge devices. This work provides a blueprint for compact, high-performing embodied AI models using staged distillation in multimodal navigation.

Abstract

In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.
Paper Structure (24 sections, 9 equations, 5 figures, 5 tables)

This paper contains 24 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Model parameters versus accuracy comparison on R2R dataset among state-of-the-art VLN methods. Compared to other student models, MiniVLN achieves the best performance. When compared to state-of-the-art (SoTA) methods, MiniVLN uses only about 12% of the parameters.
  • Figure 2: The overview of two-stage knowledge distillation process for VLN. In the pre-training phase, fine-grained knowledge is distilled, while navigation-specific knowledge is learned during fine-tuning. This approach better narrows the performance gap between the teacher and student models compared to single-stage distillation.
  • Figure 3: Overall framework of MiniVLN. The yellow box represents the teacher model, while the blue box denotes the student model. The orange arrows represent the distillation process during the pre-training phase, while the blue arrows denote the distillation during the fine-tuning phase. During the pre-training phase, we perform fine-grained distillation by designing $\mathcal{L}_{t}$, which distills knowledge between Transformer layers for feature and representation learning. In the fine-tuning phase, we distill only the logits directly related to navigation.
  • Figure 4: Ablation of two-stage distillation on the R2R dataset. MiniVLN maintains performance comparable to the teacher model while achieving approximately 4% higher performance than the non-distilled model.
  • Figure 5: The inference time comparison between ScaleVLN and MiniVLN with CPU. On both datasets, MiniVLN exhibits an inference speed that is more than three times faster than ScaleVLN.