Ultra-Sparse Memory Network
Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou
TL;DR
UltraMem addresses the energy and latency costs of inference in large transformers by introducing ultra-sparse memory layers that extend the idea of Product-Key Memory (PKM). It combines Tucker Decomposition-based query-key retrieval (TDQKR), Implicit Value Expansion (IVE), and Multi-Core Scoring (MCS) within a Pre-LayerNorm Transformer to enable billions of memory slots with minimal memory access. Empirically, UltraMem outperforms MoE and PKM at the same parameter and compute budgets and demonstrates favorable scaling laws, achieving up to 6× faster inference in practical batch regimes and matching much larger dense models at smaller costs. The results suggest UltraMem as a scalable, efficient route to deploy and train massive sparse-memory language models in resource-constrained settings.
Abstract
It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms MoE. In experiments, the largest UltraMem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget, paving the way for billions of slots or experts.
