TokenButler: Token Importance is Predictable

Yash Akhauri; Ahmed F AbouElhamayed; Yifei Gao; Chi-Chih Chang; Nilesh Jain; Mohamed S. Abdelfattah

TokenButler: Token Importance is Predictable

Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah

TL;DR

TokenButler tackles the KV-Cache memory bottleneck in long-context LLMs by introducing a lightweight, per-head token-importance predictor that approximates full attention logits at decode-time. Trained with a mean-squared-error objective against true pre-softmax logits while the LLM is frozen, it delivers fine-grained, query-aware token prioritization with negligible overhead (<$2\%$ latency) and parameter cost (<$1.2\%$ of the LLM). Empirical results show up to 8% improvements in perplexity and downstream accuracy over state-of-the-art token-importance methods, and strong performance on a synthetic co-reference benchmark, standard benchmarks, and reasoning models. The work demonstrates that per-head, high-granularity token selection can preserve essential context in co-reference and complex reasoning tasks while reducing memory bandwidth demands, pointing to practical deployment of decode-time token loading strategies guided by learned importance.

Abstract

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler

TokenButler: Token Importance is Predictable

TL;DR

Abstract

TokenButler: Token Importance is Predictable

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)