Table of Contents
Fetching ...

Adaptive Semiparametric Language Models

Dani Yogatama, Cyprien de Masson d'Autume, Lingpeng Kong

TL;DR

The paper introduces SPALM, a semiparametric language model that combines a Transformer with both short-term working memory and long-term episodic memory retrieved via $k$-NN search. A context-dependent gate blends local, extended, and global information to predict the next token, enabling adaptive memory usage across contexts. Empirical results on WikiText-103, WMT, and enwik8 show SPALM beating strong baselines, with notable gains from integrating long-term memory and reduced reliance on fixed interpolation. The work demonstrates the viability of architectural memory integration for language modeling and points toward extensible, multi-modal memory frameworks.

Abstract

We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states -- similar to transformer-XL -- and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.

Adaptive Semiparametric Language Models

TL;DR

The paper introduces SPALM, a semiparametric language model that combines a Transformer with both short-term working memory and long-term episodic memory retrieved via -NN search. A context-dependent gate blends local, extended, and global information to predict the next token, enabling adaptive memory usage across contexts. Empirical results on WikiText-103, WMT, and enwik8 show SPALM beating strong baselines, with notable gains from integrating long-term memory and reduced reliance on fixed interpolation. The work demonstrates the viability of architectural memory integration for language modeling and points toward extensible, multi-modal memory frameworks.

Abstract

We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states -- similar to transformer-XL -- and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.

Paper Structure

This paper contains 24 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our language model architecture has three main components: (i) a transformer that processes the current local context, (ii) a short-term memory module which stores hidden states from an extended context, (iii) and a key-value (hidden state-output token) database that stores compressed long-term context. At each timestep, our model combines the current context and short-term memory with a mechanism similar to transformer-XL. It then retrieves a set of past output tokens that are used in a similar context from the long-term memory module. These past output tokens are then encoded and aggregated to a single vector that represents long-term information. We use a context-dependent gate to combine information from multiple sources for making a final prediction.
  • Figure 2: A sequence of words from WMT and its four nearest neighbors at each position. We break down the sequence into four blocks. The bottom row of each block in blue represents the original sequence, which is Elizabeth Warren on Friday ... the middle class. Each row above it represents a nearest neighbor token (starting from the first neighbor at the second-bottom to the fourth neighbor at the top) that is used when predicting that particular word. We highlight matching neighbor--target words in green. We provide a more detailed discussion in §\ref{['sec:neighborexamples']}.
  • Figure 3: A sequence of characters from enwik8 and its two nearest neighbors at each position. We break down the sequence into two blocks. The bottom row of each block in blue represents the original character sequence , which is Even before ... [[1979]]. The two rows above it represent the nearest neighbors (the first nearest neighbors at the second bottom row and the second nearest neighbors at the top row) that are used when predicting that particular character. We highlight matching neighbor--target characters in green. We provide a more detailed discussion in §\ref{['sec:neighborexamples']}.
  • Figure 4: Three example sequences from the WMT test set. We highlight words where both $p_{\text{TXL}}$ and $p_{\textsc{Spalm}}$ are larger than $p_{\text{transformer}} + 0.1$ in green and $p_{\textsc{Spalm}} > p_{\text{TXL}} + 0.1$ in blue. See §\ref{['sec:outputanalysis']} for details.
  • Figure 5: Distributions of values of $\mathbf{z}$ for WMT (left) and enwik8 (right) development sets.
  • ...and 2 more figures