Table of Contents
Fetching ...

An Evolved Universal Transformer Memory

Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang

TL;DR

The paper addresses the escalating costs and limited context windows of modern transformers by introducing Neural Attention Memory Models (NAMMs), a lightweight, evolution-optimized memory-management framework that prunes the KV cache based on attention-derived features. NAMMs operate on attention matrices to produce per-token eviction scores, using a backward-masked architecture (BAM) and short-time Fourier transform-based spectrogram features to capture cross-token dynamics in a model-agnostic way. Through CMA-ES-driven incremental evolution on a context-extended Llama 3 8B base model, NAMMs achieve substantial long-context performance gains across multiple benchmarks, while reducing KV cache usage, and they transfer zero-shot to unseen architectures and modalities, including vision and reinforcement learning. The approach demonstrates that memory management can be learned orthogonally to gradient-based training, enabling efficient long-context reasoning with broad applicability and potential for future extensions across tasks and modalities.

Abstract

Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.

An Evolved Universal Transformer Memory

TL;DR

The paper addresses the escalating costs and limited context windows of modern transformers by introducing Neural Attention Memory Models (NAMMs), a lightweight, evolution-optimized memory-management framework that prunes the KV cache based on attention-derived features. NAMMs operate on attention matrices to produce per-token eviction scores, using a backward-masked architecture (BAM) and short-time Fourier transform-based spectrogram features to capture cross-token dynamics in a model-agnostic way. Through CMA-ES-driven incremental evolution on a context-extended Llama 3 8B base model, NAMMs achieve substantial long-context performance gains across multiple benchmarks, while reducing KV cache usage, and they transfer zero-shot to unseen architectures and modalities, including vision and reinforcement learning. The approach demonstrates that memory management can be learned orthogonally to gradient-based training, enabling efficient long-context reasoning with broad applicability and potential for future extensions across tasks and modalities.

Abstract

Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.

Paper Structure

This paper contains 35 sections, 6 equations, 16 figures, 22 tables, 2 algorithms.

Figures (16)

  • Figure 1: NAMMs use evolution to optimize the performance of LMs by pruning their KV cache memory. Evolved NAMMs can be zero-shot transferred to other transformers, even across input modalities and task domains.
  • Figure 2: Schematic depiction of our Neural Attention Memory Model design. We extract features from a spectrogram over the attention values of the KV cache tokens (left), which we reduce via an element-wise exponential moving average (EMA) operation (center). These features are fed to our memory model's networks with fully connected (FC) and cross-token BAM connections (right).
  • Figure 3: Our backward mask makes each token attend exclusively to its future relatives in the KV cache.
  • Figure 4: Mean and standard deviation over the CMA-ES population batch performance (left), together with the performance of the learned mean parameter on each task (right).
  • Figure 5: Comparing NAMM with H2O and L2 while varying the cache size.
  • ...and 11 more figures