Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Shangyi Geng; Wenting Zhao; Alexander M Rush

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Shangyi Geng, Wenting Zhao, Alexander M Rush

TL;DR

This work probes whether the perplexity gains of $k$NN-LMs translate into genuine downstream reasoning capabilities. Using two domain-specific datastores (Wiki and Math) on a broad set of 22 tasks, the authors find that while $k$NN-LMs improve perplexity and help memory-intensive, pattern-based tasks, they often degrade performance on reasoning tasks that require integrating information across sources. Through oracle retrieval experiments and qualitative analyses, they show that even perfect retrieval does not guarantee correct answers, indicating an intrinsic upper bound on reasoning with non-parametric memory. The results caution against relying on perplexity as a proxy for broad LM ability and suggest that improvements in retrieval alone may be insufficient without better integration into reasoning processes; future work could explore training-based retrieval or larger models to close the gap.

Abstract

$K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a $k$NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, $k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

TL;DR

This work probes whether the perplexity gains of

NN-LMs translate into genuine downstream reasoning capabilities. Using two domain-specific datastores (Wiki and Math) on a broad set of 22 tasks, the authors find that while

NN-LMs improve perplexity and help memory-intensive, pattern-based tasks, they often degrade performance on reasoning tasks that require integrating information across sources. Through oracle retrieval experiments and qualitative analyses, they show that even perfect retrieval does not guarantee correct answers, indicating an intrinsic upper bound on reasoning with non-parametric memory. The results caution against relying on perplexity as a proxy for broad LM ability and suggest that improvements in retrieval alone may be insufficient without better integration into reasoning processes; future work could explore training-based retrieval or larger models to close the gap.

Abstract

-nearest neighbor language models (

NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a

NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate

NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that

NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval,

NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.

Paper Structure (18 sections, 2 equations, 1 figure, 14 tables)

This paper contains 18 sections, 2 equations, 1 figure, 14 tables.

Introduction
Related Work
Retrieval Models
Reasoning Retrieval.
Evaluation of $k$NN-LMs.
k-Nearest Neighbor Large Language Models
Experimental Setup.
Inference and Retrieval Models.
$k$NN-LMs Help In-Domain Perplexity
$k$NN-LMs Can Help Memory-Intensive Tasks
$k$NN-LMs Hurt Reasoning Performance
Analysis
Qualitative Analysis.
Is the problem a failure of model weighting?
Is the problem a failure of retrieval?
...and 3 more sections

Figures (1)

Figure 1: In this multi-hop question answering (QA) example, the LM is very uncertain about the next word and could benefit from retrieval. The $k$NN approach finds several document, both irrelevant and relevant, that may help. However, two issues occur: first, an irrelevant document increases the probability of a random wrong answer; second, even though a relevant document has been found, it may not upweight the actual answer (Ouse). We study how these issues may impact task performance as compared to perplexity.

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

TL;DR

Abstract

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Authors

TL;DR

Abstract

Table of Contents

Figures (1)