Table of Contents
Fetching ...

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

Haryo Akbarianto Wibowo, Thamar Solorio, Alham Fikri Aji

TL;DR

The value of KD in multilingual settings is investigated by analyzing how well the student model acquires multilingual knowledge from the teacher model to enhance initialization and it is demonstrated that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.

Abstract

Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of smaller models in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our findings show that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

TL;DR

The value of KD in multilingual settings is investigated by analyzing how well the student model acquires multilingual knowledge from the teacher model to enhance initialization and it is demonstrated that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.

Abstract

Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of smaller models in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our findings show that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.
Paper Structure (23 sections, 9 equations, 3 figures, 12 tables)

This paper contains 23 sections, 9 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Overall architecture of Knowledge Distillation used in this paper where the teacher distills its knowledge using Mean Square Loss (MSE) and followed by Cross Entropy (CE) Loss with respect to the ground truth from the labeled dataset. This is an example of pair of student-teacher models that have 2 and 4 layers, respectively. Att denotes attention output, Hidden denotes hidden output, and Emb denotes Embedding weights.
  • Figure 2: Performance across different data subsets in different initialization strategies.
  • Figure 3: Training loss plot per step across different data subsets.