The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

Haryo Akbarianto Wibowo; Thamar Solorio; Alham Fikri Aji

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

Haryo Akbarianto Wibowo, Thamar Solorio, Alham Fikri Aji

TL;DR

The value of KD in multilingual settings is investigated by analyzing how well the student model acquires multilingual knowledge from the teacher model to enhance initialization and it is demonstrated that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.

Abstract

Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of smaller models in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our findings show that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

TL;DR

Abstract

Paper Structure (23 sections, 9 equations, 3 figures, 12 tables)

This paper contains 23 sections, 9 equations, 3 figures, 12 tables.

Introduction
Methodology
Distillation Architecture
Model Initialization
Experiment Setup
Data
Model
Training
Multilingual Transferability in KD
Weight Copy Transfers More Information vs Distillation Loss
Knowledge of Unseen Languages is Transferrable with Seen Language Teacher Weight Copy
Multilingual Distillation is Possible Even if Only English Data is Available
Behavior Analysis in Copy-weight Strategy
Weight Copy model preserve some information even without finetuning
Weight Copy Models Achieve Higher Performance with Less Data
...and 8 more sections

Figures (3)

Figure 1: Overall architecture of Knowledge Distillation used in this paper where the teacher distills its knowledge using Mean Square Loss (MSE) and followed by Cross Entropy (CE) Loss with respect to the ground truth from the labeled dataset. This is an example of pair of student-teacher models that have 2 and 4 layers, respectively. Att denotes attention output, Hidden denotes hidden output, and Emb denotes Embedding weights.
Figure 2: Performance across different data subsets in different initialization strategies.
Figure 3: Training loss plot per step across different data subsets.

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

TL;DR

Abstract

The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)