Table of Contents
Fetching ...

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

TL;DR

The paper tackles the challenge of deploying ultra-compressed transformer models by proposing XTC, a two-step, simple pipeline that combines lightweight layer reduction with 1-bit quantization and distillation. Through a comprehensive systematic study, it shows that long training with data augmentation and single-stage KD suffice to achieve or exceed prior extreme quantization performance, reducing the need for expensive multi-stage distillation. The authors demonstrate that combining these strategies yields substantial compression (up to 50x) while achieving state-of-the-art GLUE results, including a 5-layer BERT-base outperforming TinyBERT. This work provides practical guidance for ultra-low-bit quantization with minimal hyperparameter tuning and reduced computational cost, facilitating edge deployments of transformer models.

Abstract

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

TL;DR

The paper tackles the challenge of deploying ultra-compressed transformer models by proposing XTC, a two-step, simple pipeline that combines lightweight layer reduction with 1-bit quantization and distillation. Through a comprehensive systematic study, it shows that long training with data augmentation and single-stage KD suffice to achieve or exceed prior extreme quantization performance, reducing the need for expensive multi-stage distillation. The authors demonstrate that combining these strategies yields substantial compression (up to 50x) while achieving state-of-the-art GLUE results, including a 5-layer BERT-base outperforming TinyBERT. This work provides practical guidance for ultra-low-bit quantization with minimal hyperparameter tuning and reduced computational cost, facilitating edge deployments of transformer models.

Abstract

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
Paper Structure (18 sections, 2 equations, 4 figures, 24 tables)

This paper contains 18 sections, 2 equations, 4 figures, 24 tables.

Figures (4)

  • Figure 1: The left figure summarizes how to do 1-bit quantization for a layer-reduced model based on jiao-etal-2020-tinybertbai-etal-2021-binarybert. It involves expensive pretraining on an fp-32 small model, task-specific training on 32-bit and 2-bit models, weight-splitting, and the final 1-bit model training. Along the way, it applies multi-stage knowledge distillation with data augumentation, which needs considerable hyperparameter tuning efforts. The right figure is our proposed method, XTC (see details in § \ref{['sec:design']}), a simple while effective pipeline (see Figure \ref{['fig:sota']} for highlighted results). Better read with a computer screen.
  • Figure 2: The comparison between XTC with other SOTA results.
  • Figure 3: Performance of quantized BERT$_{\text{base}}$ with different weight bits and 8-bit activation on the GLUE Benchmarks. The results for orange and blue curves respectively represent the costs: (limited) Budget-A and (sufficient) Budget-C. The fp32-teacher scores are shown by black square marker.
  • Figure 4: Three types of knowledge distillation. 1S-KD KD (top red arrow line) involves all the outputs of hidden-states, attentions and logits from the beginning of the training to the end. 2S-KD KD (middle red and blue arrow line) separates hidden-states and attentions from the logits part. While 3S-KD KD (bottom red, blue and green arrow line) succeed 2S-KD one, it also adds a transition phase in the middle of the training.