Extreme Compression for Pre-trained Transformers Made Simple and Efficient
Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
TL;DR
The paper tackles the challenge of deploying ultra-compressed transformer models by proposing XTC, a two-step, simple pipeline that combines lightweight layer reduction with 1-bit quantization and distillation. Through a comprehensive systematic study, it shows that long training with data augmentation and single-stage KD suffice to achieve or exceed prior extreme quantization performance, reducing the need for expensive multi-stage distillation. The authors demonstrate that combining these strategies yields substantial compression (up to 50x) while achieving state-of-the-art GLUE results, including a 5-layer BERT-base outperforming TinyBERT. This work provides practical guidance for ultra-low-bit quantization with minimal hyperparameter tuning and reduced computational cost, facilitating edge deployments of transformer models.
Abstract
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
