Table of Contents
Fetching ...

TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan

TL;DR

TurboAttention tackles the attention bottleneck in large language model inference by merging quantized execution with KV-cache compression in a FlashAttention-compatible flow. It introduces FlashQ for headwise mixed-precision progressive quantization and SAS for a tensor-core–friendly softmax, along with an enhanced KV cache buffer to support long-context decoding. The approach yields up to 1.8x latency reduction in prefill, up to 1.7x in decoding, and up to 2.37x maximum throughput over FP16 baselines, while maintaining near-lossless accuracy across multiple models and tasks and reducing KV-cache footprint by more than 4.4x. This work demonstrates that a unified, quantized attention with cooperative KV-cache handling can substantially improve throughput and memory efficiency in real-world LLM inference scenarios.

Abstract

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

TL;DR

TurboAttention tackles the attention bottleneck in large language model inference by merging quantized execution with KV-cache compression in a FlashAttention-compatible flow. It introduces FlashQ for headwise mixed-precision progressive quantization and SAS for a tensor-core–friendly softmax, along with an enhanced KV cache buffer to support long-context decoding. The approach yields up to 1.8x latency reduction in prefill, up to 1.7x in decoding, and up to 2.37x maximum throughput over FP16 baselines, while maintaining near-lossless accuracy across multiple models and tasks and reducing KV-cache footprint by more than 4.4x. This work demonstrates that a unified, quantized attention with cooperative KV-cache handling can substantially improve throughput and memory efficiency in real-world LLM inference scenarios.

Abstract

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

Paper Structure

This paper contains 24 sections, 14 equations, 10 figures, 5 tables, 3 algorithms.

Figures (10)

  • Figure 1: Latency profile of Phi3-Medium on Nvidia A100 GPU. (a) KV cache compression techniques impose a dequantization overhead in attention kernel latency. (b) TurboAttention significantly improves attention kernel latency compared to FlashAttention(FP16) baseline while other work mainly focuses on reducing KV-cache memory footprint and bandwidth only.(c) TurboAttention reduces latency of Matmul+KV-cache load by enabling quantized integer inference, dequantization by applying block progressive quantization, and faster softmax by introducing sparse activated softmax.
  • Figure 2: High-Level comparison of TurboAttention compared to state-of-the-art KV-cache compression technique combined with FlashAttention. TurboAttention accelerates the attention mechanism by adapting (1) FlashQ which enables KV cache compression and accelerated Matmuls ((2) and (3)) and (4) SAS which enables techniques which allow faster execution of attention by utilizing the tensor cores of GPUs efficiently.
  • Figure 3: Dataflow of TurboAttention in pre-fill and decode. At pre-fill (left) we first compress QKV block-wise into INT8(Step1), and compute the attention matrix (on-line) using SAS (see section 4, Step2). Next, we compress the INT8 KV blocks into asymmetric INT4/INT2, channel-wise, in integer arithmetic: these are stored in the cache(Step3). At decoding (right), we first compress generated qkv to INT8(Step1) and decompress the KV cache to INT8 for integer inference(Step2). Again, we use SAS to compute attention(Step3).
  • Figure 4: Query, key and value channels min-max distribution of Phi3-mini and LLaMA3-8B Models. We observe that certain heads in query and key have a number of large-magnitude channels. For value, there is no obvious outlier pattern.
  • Figure 5: Polynomial-fit for the decimal part of value in exponentiation operation.
  • ...and 5 more figures