Table of Contents
Fetching ...

On Retrieval Augmentation and the Limitations of Language Model Training

Ting-Rui Chiang, Xinyan Velocity Yu, Joshua Robinson, Ollie Liu, Isabelle Lee, Dani Yogatama

TL;DR

The paper investigates why $k$NN$-$LM retrieval improves perplexity and finds that the softmax bottleneck in the last layer is not the sole reason. It introduces the Macondo dataset to study generalization from over-specification and shows scaling alone (e.g., GPT-3.5-turbo) does not resolve this gap. It demonstrates that an MLP mapping of datastore keys to values can partly replicate $k$NN benefits with far lower storage costs, offering a promising, scalable alternative. The findings suggest that improving LM generalization under over-specification and exploring retrieval-inspired surrogates could yield practical gains beyond simply increasing model size or data.

Abstract

Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, $k$NN retrieval augmentation consistently improves performance in this setting. Finally, to make $k$NN retrieval more accessible, we propose using a multi-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costs by over 25x.

On Retrieval Augmentation and the Limitations of Language Model Training

TL;DR

The paper investigates why NNLM retrieval improves perplexity and finds that the softmax bottleneck in the last layer is not the sole reason. It introduces the Macondo dataset to study generalization from over-specification and shows scaling alone (e.g., GPT-3.5-turbo) does not resolve this gap. It demonstrates that an MLP mapping of datastore keys to values can partly replicate NN benefits with far lower storage costs, offering a promising, scalable alternative. The findings suggest that improving LM generalization under over-specification and exploring retrieval-inspired surrogates could yield practical gains beyond simply increasing model size or data.

Abstract

Augmenting a language model (LM) with -nearest neighbors (NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, NN retrieval augmentation consistently improves performance in this setting. Finally, to make NN retrieval more accessible, we propose using a multi-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costs by over 25x.
Paper Structure (28 sections, 4 equations, 5 figures, 7 tables)

This paper contains 28 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Test log-likelihood of children names in our synthetic dataset Macondo predicted by a fine-tuned GPT-2 XL model for parents with 1-5 children (average of 5 random seeds). The dotted lines represent the results of the $k$NN augmented LM. The horizontal lines represent the theoretically best log-likelihood a perfect model can achieve ($\log(1/ \text{\# of children})$). See Table \ref{['table:macondo-gpt2-xl']} for the exact statistics shown in this figure.
  • Figure 2: GPT-3.5-turbo fine-tuned with Macondo-Conv using OpenAI API. The results are the average of 5 runs with 5 datasets generated with 5 random seeds. Note that the presented loss involves special tokens, e.g., end-of-string tokens, so the theoretical perfect likelihood is greater than $\log 0.5$. The gray line is the test loss we achieve when we use the test data to train the model.
  • Figure 3: Log likelihood of children names in our synthetic dataset Macondo predicted by a fine-tuned GPT-2/Mistral-7B-v0.1 model for parents with 1-5 children (average of 5 random seeds). The dotted lines represent the results of the k-NN augmented LM. The horizontal lines represent the theoretically best log-likelihood a perfect model can achieve ($\log(1/ \text{\# of children})$).
  • Figure 4: Log likelihood of the children's names in Macondo. The results are the average of 5 random seeds.
  • Figure 5: The training loss on WikiText.