Table of Contents
Fetching ...

Ultra-Sparse Memory Network

Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou

TL;DR

UltraMem addresses the energy and latency costs of inference in large transformers by introducing ultra-sparse memory layers that extend the idea of Product-Key Memory (PKM). It combines Tucker Decomposition-based query-key retrieval (TDQKR), Implicit Value Expansion (IVE), and Multi-Core Scoring (MCS) within a Pre-LayerNorm Transformer to enable billions of memory slots with minimal memory access. Empirically, UltraMem outperforms MoE and PKM at the same parameter and compute budgets and demonstrates favorable scaling laws, achieving up to 6× faster inference in practical batch regimes and matching much larger dense models at smaller costs. The results suggest UltraMem as a scalable, efficient route to deploy and train massive sparse-memory language models in resource-constrained settings.

Abstract

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms MoE. In experiments, the largest UltraMem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts.

Ultra-Sparse Memory Network

TL;DR

UltraMem addresses the energy and latency costs of inference in large transformers by introducing ultra-sparse memory layers that extend the idea of Product-Key Memory (PKM). It combines Tucker Decomposition-based query-key retrieval (TDQKR), Implicit Value Expansion (IVE), and Multi-Core Scoring (MCS) within a Pre-LayerNorm Transformer to enable billions of memory slots with minimal memory access. Empirically, UltraMem outperforms MoE and PKM at the same parameter and compute budgets and demonstrates favorable scaling laws, achieving up to 6× faster inference in practical batch regimes and matching much larger dense models at smaller costs. The results suggest UltraMem as a scalable, efficient route to deploy and train massive sparse-memory language models in resource-constrained settings.

Abstract

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms MoE. In experiments, the largest UltraMem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts.

Paper Structure

This paper contains 18 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We ensured that three models have the same computation, and MoE and UltraMem have the same parameters. The x-axis is plotted on a logarithmic scale. In (b) and (c), the sequence length is 1 because during decoding time, we can only predict one token at a time, and the key/value cache length is 2048. The experiments in (b) and (c) are conducted on the A100-SXM-80GB.
  • Figure 2: An overview of multilayer perceptron (MLP) and large memory layer (LML). For the sake of brevity, we omit the third top-$m$ operation from memory layer. An MLP typically consists of two linear layers and a GeLU activation. We consider the weights of the first linear layer as keys, and those of the second linear layer as values. LML uses row and column keys to determine the 2-D logical address to index memory values, whereas MLP uses 1-D logical address. "fetch value" refers to retrieving values based on the indices with higher scores.
  • Figure 3: Overall of PKM and UltraMem.
  • Figure 4: Flow of Tucker Decomposed Query-Key Retrieval (TDQKR), here $r=2$. The term "fetch" refers to the action of retrieving scores based on a given index (corresponding to "torch.gather"). TDQKR replacing Product Quantization, serves as a more precise retrieval module for recalling value indices in UltraMem. Each step of the TDQKR process is meticulously referenced within the main text for understanding.
  • Figure 5: Flow of Implicit Value Expansion (IVE), here $E=4$, $m=16$. IVE reduces memory access and scales up memory size by expanding the memory table virtually. Each virtual block is a reparameterization of the physical memory table. Every virtual memory address corresponds to a physical memory address and a projector index. The weighted sum pooling is grouped by the virtual blocks, followed by a linear layer to produce the final output.
  • ...and 6 more figures