Table of Contents
Fetching ...

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Junhan Kim, Chungman Lee, Eulrang Cho, Kyungphil Park, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

TL;DR

A novel PTQ algorithm called aespa is proposed that is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency within the attention module.

Abstract

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required. As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

TL;DR

A novel PTQ algorithm called aespa is proposed that is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency within the attention module.

Abstract

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required. As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.
Paper Structure (33 sections, 43 equations, 2 figures, 19 tables, 1 algorithm)

This paper contains 33 sections, 43 equations, 2 figures, 19 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of aespa. Each weight is quantized separately to reconstruct the attention output.
  • Figure 2: Quantization strategies (simplified)