MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu; Yanxuan Yu; Ben Lengerich; Ying Nian Wu

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu

Abstract

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Abstract

Paper Structure (36 sections, 1 theorem, 10 equations, 2 figures, 11 tables, 3 algorithms)

This paper contains 36 sections, 1 theorem, 10 equations, 2 figures, 11 tables, 3 algorithms.

Introduction
Related Work
Long-context Attention Mechanisms
Hierarchical and External Memory Systems
Dynamic Routing and Query-aware Attention
Motivation: Beyond MLA and MHA
Methodology
Route-Fused MKA (FastMKA): Causal Route-Fusion with KV-Cache
Block-Memory Keyed Attention (Block-MKA) Design
Hierarchical Block-wise MKA Algorithm
Algorithm and Pseudocode
Theoretical Formulation: Recursive MKA with Online Softmax
Recursive Reformulation
Numerical Stability via Max-Shift
Local vs. Global MKA Modes
...and 21 more sections

Key Result

Theorem 5.1

The recursive formulation defined by Equations (6)-(8) computes the gated mixture attention:

Figures (2)

Figure 1: MKA hierarchical memory design with three memory levels (L1: local, L2: session, L3: long-term) and dynamic routing gates.
Figure 2: Hierarchical Memory-Keyed Attention (MKA) with Multi-Level Routing: illustrates the three-tier memory architecture (L1/L2/L3) and query-based routing mechanism.

Theorems & Definitions (1)

Theorem 5.1: Recursive MKA Computes Gated Mixture Attention

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Abstract

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (1)