Table of Contents
Fetching ...

IterL2Norm: Fast Iterative L2-Normalization

ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong

TL;DR

This work introduces IterL2Norm, a division- and square-root-free iterative L2-normalization method for on-chip layer normalization in transformer-based LLMs. Grounded in a dynamic-system framework, it replaces the costly $\sigma_y$-computation with a small fixed-point Euler update that converges to $\boldsymbol{y}/\|\boldsymbol{y}\|_2$ within five iterations, enabling on-chip normalization alongside MatMul engines. The authors provide a full macro design in 32/28nm CMOS, including initialization, update-rate settings, and data-paths, and demonstrate 116–227 cycles latency for $64 \le d \le 1024$, with sub-millisecond power/area suitable for deployment in large-scale LLM accelerators. Precision tests across FP32/FP16/BF16 show competitive accuracy relative to ground-truth layer normalization and favorable comparisons to fast inverse square root methods, including an LLM-level evaluation on OPT models, where perplexity degrades only slightly with a small number of iterations. The results indicate IterL2Norm as a practical, efficient alternative for on-chip layer normalization that can significantly reduce data movement and energy in memory-bound transformer workloads.

Abstract

Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 \leq d \leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.

IterL2Norm: Fast Iterative L2-Normalization

TL;DR

This work introduces IterL2Norm, a division- and square-root-free iterative L2-normalization method for on-chip layer normalization in transformer-based LLMs. Grounded in a dynamic-system framework, it replaces the costly -computation with a small fixed-point Euler update that converges to within five iterations, enabling on-chip normalization alongside MatMul engines. The authors provide a full macro design in 32/28nm CMOS, including initialization, update-rate settings, and data-paths, and demonstrate 116–227 cycles latency for , with sub-millisecond power/area suitable for deployment in large-scale LLM accelerators. Precision tests across FP32/FP16/BF16 show competitive accuracy relative to ground-truth layer normalization and favorable comparisons to fast inverse square root methods, including an LLM-level evaluation on OPT models, where perplexity degrades only slightly with a small number of iterations. The results indicate IterL2Norm as a practical, efficient alternative for on-chip layer normalization that can significantly reduce data movement and energy in memory-bound transformer workloads.

Abstract

Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes -dimensional vectors, where , with a latency of 116-227 cycles at 100MHz/1.05V.

Paper Structure

This paper contains 13 sections, 1 theorem, 13 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $\boldsymbol{y}$ and $\tilde{\boldsymbol{y}}$ be vectors of the same length. Let $k$ be a nonzero scalar value such that $k=\boldsymbol{y}\cdot\tilde{\boldsymbol{y}}$. Consider the following differential equation for $\tilde{\boldsymbol{y}}$ for a given $\boldsymbol{y}$. where $\alpha$ is a positive constant. For a given $\boldsymbol{y}$, $\tilde{\boldsymbol{y}}$ is initialized to $\tilde{\bo

Figures (6)

  • Figure 1: (a) Architecture of the IterL2Norm macro. (b) Data organization in the Input buffer. (c) Block diagram of the Add block equipped with total nine 8-input adder trees.
  • Figure 2: Architecture of (a) the initialize and (b) the update modules in the iteration controller.
  • Figure 3: Approximation precision of IterL2Norm for various input lengths $d$ in (a) FP32, (b) FP16, and (c) BFloat16. The insets show the distribution of errors for $d=384$ over 1,000 input vectors.
  • Figure 4: Average absolute errors of IterL2Norm in FP32, FP16, and BFloat16 with the number of iteration steps.
  • Figure 5: Measured latency of IterL2Norm (five iteration steps) with input length $d$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 2.1
  • proof