Table of Contents
Fetching ...

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala

TL;DR

This paper introduces SWiM, a customizable framework for evaluating long-context language models on real-world documents, addressing the limitations of traditional needle-in-a-haystack tests. SWiM comprises task generation, validation, completion, and evaluation, and is demonstrated across eight long-context models to reveal robust yet context-position dependent performance, notably the lost-in-the-middle effect. To counter this, the authors propose medoid voting, a training-free method that runs multiple context permutations and selects the medoid response, achieving substantial accuracy gains (up to 24.2 percentage points) in single-document QA tasks. The work highlights the practical need for production-relevant benchmarks and provides a scalable tool and methodology for assessing and improving long-context reasoning in real-world applications.

Abstract

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy. Our code is available at https://github.com/snorkel-ai/long-context-eval.

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

TL;DR

This paper introduces SWiM, a customizable framework for evaluating long-context language models on real-world documents, addressing the limitations of traditional needle-in-a-haystack tests. SWiM comprises task generation, validation, completion, and evaluation, and is demonstrated across eight long-context models to reveal robust yet context-position dependent performance, notably the lost-in-the-middle effect. To counter this, the authors propose medoid voting, a training-free method that runs multiple context permutations and selects the medoid response, achieving substantial accuracy gains (up to 24.2 percentage points) in single-document QA tasks. The work highlights the practical need for production-relevant benchmarks and provides a scalable tool and methodology for assessing and improving long-context reasoning in real-world applications.

Abstract

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy. Our code is available at https://github.com/snorkel-ai/long-context-eval.
Paper Structure (9 sections, 5 figures, 3 algorithms)

This paper contains 9 sections, 5 figures, 3 algorithms.

Figures (5)

  • Figure 1: Results of NIAH (answering a synthetic “What is the best thing to do in San Francisco?” needle on Paul Graham essays as the haystack), alongside SWiM on a QA task. GPT-4 and Claude 2.1 obtain perfect scores on the NIAH test, at all document depths. But a more realistic QA task on narrative content using SWiM reveals the typical “lost-in-the-middle” effect.
  • Figure 2: SWiM Framework
  • Figure 3: More analyses on LLMs with SWiM framework. (Left) shows that not all models utilize their long context windows effectively, though their context window lengths are enough to include entire documents. (Right) reveals "Lost-in-the-middle" effect is significant and common across many LLMs.
  • Figure 4: Medoid voting can easily smooth out the "lost-in-the-middle" effect.
  • Figure 5: Voting accuracy depending on the depth of document with answer