LLM Inference Acceleration via Efficient Operation Fusion
Mahsa Salmani, Ilya Soloveychik
TL;DR
The paper tackles latency in Transformer-based LLM inference caused by collective operations needed for Softmax and Layernorm denominators. It proposes an algebraically equivalent operation-fusion framework that defers normalization until after the subsequent matrix multiplication, exploiting commutativity between linear and non-linear steps to overlap computations on separate hardware engines (DIMC and SIMD). The methodology decomposes Layernorm and Softmax into element-wise and collective sub-operations, with explicit fused formulations such as $yF = \frac{1}{\sqrt{\sigma^2+\epsilon}} ( x (I - \frac{1}{n} E) \boldsymbol{\Gamma} ) F + \boldsymbol{\beta}F$ for Layernorm and $y\mathbf{V} = \frac{1}{\sum_i e^{x_i}} [e^{x_1},\ldots,e^{x_n}]\mathbf{V}$ for Softmax, enabling simultaneous execution of the linear layer and the normalization denominator. Implemented on the Corsair AI accelerator, the approach yields around a 20% reduction in inference latency for models like Llama2 and Llama3 while maintaining exact numerical accuracy. This work demonstrates a practical hardware-aware strategy to accelerate large-scale Transformer inference without compromising model performance, motivating further hardware-software co-design extensions to additional components.
Abstract
The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These collective operations slow down inference on Transformers by approximately 20%, defeating the whole purpose of distributed in-memory compute. In this work, we propose an extremely efficient technique that can completely hide the overhead caused by such collective operations. Note that each Softmax and Layernorm operation is typically followed by a linear layer. Since non-linear and linear operations are performed on different hardware engines, they can be easily parallelized once the algebra allows such commutation. By leveraging the inherent properties of linear operations, we can defer the normalization of the preceding Softmax and Layernorm until after the linear layer is computed. Now we can compute the collective scaling factors concurrently with the matrix multiplication and completely hide the latency of the former behind the latter. Such parallelization preserves the numerical accuracy while significantly improving the hardware utilization and reducing the overall latency.
