QKV Projections Require a Fraction of Their Memory

Malik Khalaf; Yara Shamshoum; Nitzan Hodos; Yuval Sieradzki; Assaf Schuster

QKV Projections Require a Fraction of Their Memory

Malik Khalaf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

TL;DR

Point-Approximate Matrix Multiplication (PAMM) is proposed, a novel tensor compression technique that compresses the activations of the activations of the Q,K,V projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity.

Abstract

The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that compresses the activations of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

QKV Projections Require a Fraction of Their Memory

TL;DR

, effectively erasing their memory footprint, while achieving similar or better final perplexity.

Abstract

, and

tensors from the input

is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that compresses the activations of the

projections in attention layers by a factor of up to

, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

QKV Projections Require a Fraction of Their Memory

TL;DR

Abstract

QKV Projections Require a Fraction of Their Memory

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)