Table of Contents
Fetching ...

A Survey on Transformer Compression

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

TL;DR

This survey analyzes how to compress Transformer-based models to enable practical deployment of LLMs and LVMs on resource-constrained devices. It organizes methods into quantization (PTQ and QAT), knowledge distillation (logits-based, hint-based, and API-based KD), pruning (unstructured and structured), and efficient architecture design, with detailed NLP and CV considerations. The authors discuss the interrelationships among methods, training-efficiency challenges, and practical constraints, highlighting directions such as extreme low-bit quantization, hardware-aware pruning, and architecture alternatives like RWKV, RetNet, and Mamba. The work provides a comprehensive roadmap for advancing Transformer compression, emphasizing cross-domain insights and the importance of scalable, deployable solutions in real-world settings.

Abstract

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.

A Survey on Transformer Compression

TL;DR

This survey analyzes how to compress Transformer-based models to enable practical deployment of LLMs and LVMs on resource-constrained devices. It organizes methods into quantization (PTQ and QAT), knowledge distillation (logits-based, hint-based, and API-based KD), pruning (unstructured and structured), and efficient architecture design, with detailed NLP and CV considerations. The authors discuss the interrelationships among methods, training-efficiency challenges, and practical constraints, highlighting directions such as extreme low-bit quantization, hardware-aware pruning, and architecture alternatives like RWKV, RetNet, and Mamba. The work provides a comprehensive roadmap for advancing Transformer compression, emphasizing cross-domain insights and the importance of scalable, deployable solutions in real-world settings.

Abstract

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.
Paper Structure (26 sections, 11 equations, 13 figures, 8 tables)

This paper contains 26 sections, 11 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Transformer-based models have emerged as the predominant architectures in both natural language processing (NLP) and computer vision (CV) domains, resulting in a surge in publications. As these models tend to possess substantial dimensions, it becomes imperative to compress their parameters and streamline computational redundancies. This compression is essential for facilitating efficient implementation on practical platforms, ensuring the feasibility of deploying Transformer models in real-world applications.
  • Figure 2: The overview of quantization for Transformers. The top demonstrates the different problems that are addressed in existing works for computer vision and natural language processing, and the bottom shows a normal INT8 inference process of a standard Transformer block.
  • Figure 3: ViT-B_16
  • Figure 4: ViT-B_16-224
  • Figure 5: ViT-L_16
  • ...and 8 more figures