Table of Contents
Fetching ...

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models

Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung

TL;DR

This paper presents a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks, and presents the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations.

Abstract

To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., <1 perplexity increase.

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models

TL;DR

This paper presents a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks, and presents the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations.

Abstract

To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., <1 perplexity increase.
Paper Structure (18 sections, 3 equations, 8 figures, 3 tables)

This paper contains 18 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The comparison of single-batch latency in running the Llama2's first FFN layer (mlp.0) at various sizes and bit-widths using CUTLASS cutlass. The 'hGEMM' uses FP16 computing units while 'iGEMM' uses INT8 computing units.
  • Figure 2: Various data formats: (a) bfloat16, (b) original MXINT4 microscaling, and (c) the proposed outlier-preserved MXINT4 format (i.e., MX-OPAL4).
  • Figure 3: Comparison between different data formats on quantizing the original data of 128 elements extracted from the 2nd decoder block in Llama2-7B. (a) Original data, (b) 2-bit MinMax, (c) MXINT2, and (d) MX-OPAL2.
  • Figure 4: Comparison of the impact of preserving varying number of outliers (n) on the quantization noise with MX formats at the 20th decoder block in Llama2-7B. The block size k is set to 128, and (a) 'sign + mantissa bits' = 8 ($b=8$), (b) 'sign + mantissa bits' = 4 ($b=4$).
  • Figure 5: Overview of OPAL computation flow: (a) one decoder block, (b) a feed forward network (i.e., two FC layers), (c) an attention layer, (d) a multi-head attention layer, and (e) shift-based '$Attn\cdot V$' owing to log2-based softmax.
  • ...and 3 more figures