Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Xiaoxia Wu; Zhewei Yao; Minjia Zhang; Conglong Li; Yuxiong He

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

TL;DR

The paper tackles the challenge of deploying ultra-compressed transformer models by proposing XTC, a two-step, simple pipeline that combines lightweight layer reduction with 1-bit quantization and distillation. Through a comprehensive systematic study, it shows that long training with data augmentation and single-stage KD suffice to achieve or exceed prior extreme quantization performance, reducing the need for expensive multi-stage distillation. The authors demonstrate that combining these strategies yields substantial compression (up to 50x) while achieving state-of-the-art GLUE results, including a 5-layer BERT-base outperforming TinyBERT. This work provides practical guidance for ultra-low-bit quantization with minimal hyperparameter tuning and reduced computational cost, facilitating edge deployments of transformer models.

Abstract

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 4 figures, 24 tables)

This paper contains 18 sections, 2 equations, 4 figures, 24 tables.

Introduction
Contribution.
Related Work
Extreme Compression Procedure Analysis
Is staged ternary-binary training necessary to mitigate the sharp performance drop?
The role of multi-stage knowledge distillation
The importance of data augmentation
Interplay of KD, Long Training, Data Augmentation and Layer Reduction
Proposed Method for Further Pushing the Limit of Extreme Compression
Evaluation Results.
Conclusions
Additional Related Work
Additional Details on Methodology, Experimental Setup and Results
Knowledge Distillation
Experimental Setup
...and 3 more sections

Figures (4)

Figure 1: The left figure summarizes how to do 1-bit quantization for a layer-reduced model based on jiao-etal-2020-tinybertbai-etal-2021-binarybert. It involves expensive pretraining on an fp-32 small model, task-specific training on 32-bit and 2-bit models, weight-splitting, and the final 1-bit model training. Along the way, it applies multi-stage knowledge distillation with data augumentation, which needs considerable hyperparameter tuning efforts. The right figure is our proposed method, XTC (see details in § \ref{['sec:design']}), a simple while effective pipeline (see Figure \ref{['fig:sota']} for highlighted results). Better read with a computer screen.
Figure 2: The comparison between XTC with other SOTA results.
Figure 3: Performance of quantized BERT$_{\text{base}}$ with different weight bits and 8-bit activation on the GLUE Benchmarks. The results for orange and blue curves respectively represent the costs: (limited) Budget-A and (sufficient) Budget-C. The fp32-teacher scores are shown by black square marker.
Figure 4: Three types of knowledge distillation. 1S-KD KD (top red arrow line) involves all the outputs of hidden-states, attentions and logits from the beginning of the training to the end. 2S-KD KD (middle red and blue arrow line) separates hidden-states and attentions from the logits part. While 3S-KD KD (bottom red, blue and green arrow line) succeed 2S-KD one, it also adds a transition phase in the middle of the training.

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

TL;DR

Abstract

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Authors

TL;DR

Abstract

Table of Contents

Figures (4)