Table of Contents
Fetching ...

MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

Nathanaël Carraz Rakotonirina, Marco Baroni

TL;DR

MemoryPrompt addresses the limitation of fixed context windows in transformer LMs by introducing a lightweight external memory module that prefixes the input with continuous memory vectors $\mathbf{P} \in \mathbb{R}^{m\times e}$ where $m=5$ and $e$ is the embedding size, computed by a small MLP-LSTM and keeping the base LM frozen. This design enables long-range context tracking without finetuning the LM. Empirically, MemoryPrompt to outperforms larger full-context models on a fact-updating task and matches full-history performance on MSC, while showing substantially less catastrophic forgetting than finetuned memory baselines. The work suggests a practical path to adapting pre-trained LMs to evolving information with minimal retraining, enabling efficient deployment on smaller hardware.

Abstract

Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.

MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

TL;DR

MemoryPrompt addresses the limitation of fixed context windows in transformer LMs by introducing a lightweight external memory module that prefixes the input with continuous memory vectors where and is the embedding size, computed by a small MLP-LSTM and keeping the base LM frozen. This design enables long-range context tracking without finetuning the LM. Empirically, MemoryPrompt to outperforms larger full-context models on a fact-updating task and matches full-history performance on MSC, while showing substantially less catastrophic forgetting than finetuned memory baselines. The work suggests a practical path to adapting pre-trained LMs to evolving information with minimal retraining, enabling efficient deployment on smaller hardware.

Abstract

Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.
Paper Structure (18 sections, 4 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Unfolded graph of MemoryPrompt at training time. The input is divided into segments and, for each segment, the augmented system produces both the LM output and the memory vectors (blue) which are concatenated to the embeddings of the next segment.
  • Figure 2: Sequence from the short-fd dataset (see Table \ref{['tab:dataset_config']}). Non-highlighted text contains stable facts. The pivot (in blue) is the mutable fact to track, and the final answer is the most up-to-date object of the pivot. Distractors (in orange) are mutable facts distinct form the pivot, which might belong to the same relation and might be updated (here, the object associated with Guido Pepoli is updated from cardinal to bishop).
  • Figure 3: Accuracy of OPT-1.3B full-context (top) and OPT-125M + MemoryPrompt (bottom) as a function of the number of updates on the long many-updates (mu) fact-updating dataset.
  • Figure 4: Cosine similarity between one of 5 memory vectors and the embeddings of the objects in a sequence. Each figure represents a sequence, with objects of different facts on the x-axis. The pivot objects are in bold. The subject/relation of the pivots are Louis Charles Delescluze/work location (left) and Rao Remala/employer (right).