Table of Contents
Fetching ...

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi, Mehdi Rezagholizadeh

TL;DR

This work addresses the challenge of deploying oversized pre-trained language models on resource-constrained devices by introducing KroneckerBERT, which compresses embedding and Transformer weights using Kronecker factorization and restores performance with intermediate-layer knowledge distillation from a full BERT_BASE teacher. The approach yields up to $19\times$ compression (≈5% of the original size) and achieves state-of-the-art results on GLUE under high compression while also excelling on SQuAD, with favorable out-of-distribution robustness. A two-stage KD process, including pre-training distillation, is shown essential to bridge the gap between the compressed student and the teacher. The results support Kronecker-based compression as a practical pathway to efficient, robust Transformer models suitable for edge deployment and diverse NLP tasks.

Abstract

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We use this decomposition for compression of the embedding layer, all linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layer. We perform intermediate-layer knowledge distillation using the uncompressed model as the teacher to improve the performance of the compressed model. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks and show that for a high compression factor of 19 (5% of the size of the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art compression methods on the GLUE. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

TL;DR

This work addresses the challenge of deploying oversized pre-trained language models on resource-constrained devices by introducing KroneckerBERT, which compresses embedding and Transformer weights using Kronecker factorization and restores performance with intermediate-layer knowledge distillation from a full BERT_BASE teacher. The approach yields up to compression (≈5% of the original size) and achieves state-of-the-art results on GLUE under high compression while also excelling on SQuAD, with favorable out-of-distribution robustness. A two-stage KD process, including pre-training distillation, is shown essential to bridge the gap between the compressed student and the teacher. The results support Kronecker-based compression as a practical pathway to efficient, robust Transformer models suitable for edge deployment and diverse NLP tasks.

Abstract

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We use this decomposition for compression of the embedding layer, all linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layer. We perform intermediate-layer knowledge distillation using the uncompressed model as the teacher to improve the performance of the compressed model. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks and show that for a high compression factor of 19 (5% of the size of the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art compression methods on the GLUE. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.

Paper Structure

This paper contains 23 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An example of Kronecker product of two 2 by 2 matrices
  • Figure 2: Illustration of our proposed method for the compression of the embedding layer. Left: conventional embedding stored in a lookup table. Right: Our proposed compression method where the original embedding matrix is represented as a Kronecker product of a matrix and a row vector. The matrix is stored in a lookup table to minimize computation over head.
  • Figure 3: Illustration of the proposed framework. Left: A diagram of the teacher BERT model and the student KronckerBERT. Right: The two stage KD methodology used to train KroneckerBERT.
  • Figure 4: T-SNE visualization of the output of the middle Transformer layer of the fine-tuned models on SST-2 dev. Left: Fine-tuned BERT$_\text{BASE}$, middle: KroneckerBERT$_8$ fine-tuned without KD, right: KroneckerBERT$_8$ when trained using KD in two stages. The colours indicate the positive and negative classes.