Table of Contents
Fetching ...

HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models

Tianfan Peng, Jiajun Qin, Tianhua Xia, Sai Qian Zhang

TL;DR

This work tackles the normalization bottleneck in large language models by introducing HAAN, a holistic algorithm-hardware co-design for accelerating normalization operations such as LayerNorm and RMSNorm. The key ideas are to exploit cross-layer correlations in input statistics to predict or skip variance computations, use input subsampling to reduce workload, and apply quantization to ease hardware cost, all implemented in a reconfigurable accelerator with an input statistics calculator, a square root inverter, and a normalization unit. Empirical results show HAAN achieves substantial hardware efficiency gains—power savings over 60% and latency reductions around 20%—while maintaining accuracy within about 1% of FP32 baselines across multiple LLMs and tasks. The approach is supported by detailed ablations, hardware evaluations on an FPGA platform, and comparisons to existing normalization accelerators, illustrating strong potential for improving end-to-end throughput in inference and training of large-scale transformers.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage. In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions.

HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models

TL;DR

This work tackles the normalization bottleneck in large language models by introducing HAAN, a holistic algorithm-hardware co-design for accelerating normalization operations such as LayerNorm and RMSNorm. The key ideas are to exploit cross-layer correlations in input statistics to predict or skip variance computations, use input subsampling to reduce workload, and apply quantization to ease hardware cost, all implemented in a reconfigurable accelerator with an input statistics calculator, a square root inverter, and a normalization unit. Empirical results show HAAN achieves substantial hardware efficiency gains—power savings over 60% and latency reductions around 20%—while maintaining accuracy within about 1% of FP32 baselines across multiple LLMs and tasks. The approach is supported by detailed ablations, hardware evaluations on an FPGA platform, and comparisons to existing normalization accelerators, illustrating strong potential for improving end-to-end throughput in inference and training of large-scale transformers.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage. In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions.

Paper Structure

This paper contains 20 sections, 8 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Standard deviation skipping for efficient normalization. (b) Runtime breakdown for GPT-2 and OPT execution with and without applying optimization techniques, using a sequence length of 2048.
  • Figure 2: ISD values across different normalization layers within the LLaMA-7B model. Give tokens are chosen randomly. We notice that the ISD values of the later layers reveals a linearity in logarithm domain.
  • Figure 3: Data path of HAAN where $N$ denotes the size of embedding dimensions. In this figure, the floating-point and fixed-point data paths are highlighted in green and orange, respectively. The control signals are highlighted in black.
  • Figure 4: Hardware design for Input Statistics Calculator.
  • Figure 5: Hardware design of Square Root Inverter. In the figure $0x00C00000$ equals $1.5$ in fixed-point format.
  • ...and 5 more figures