Table of Contents
Fetching ...

Enhancing LLM Evaluations: The Garbling Trick

William F. Bradley

TL;DR

The paper introduces the garbling trick, a meta-evaluation technique that converts existing text-based LLM assessments into curves of increasingly difficult tasks by randomly garbling the context with rate $p$, producing $s(p)$ that exposes latent reasoning abilities and reduces saturation. They demonstrate the approach by constructing NeoSQuAD, a 10,000-question, three-option MCQA dataset derived from SQuAD 2.0 and augmented with incorrect options, then generate score curves for multiple LLMs, including base and test-time compute-enabled reasoning models. The results show that at modest garbling ($p\approx 0.3$) model scores separate and reveal reasoning strengths and weaknesses not visible at $p=0$, with larger reasoning models achieving robust performance in high garble regimes. The work discusses contextual-core artifacts, design choices for core selection, and practical extensions such as temperature effects and structured outputs to reduce invalid responses, highlighting the method’s potential to guide long-horizon evaluation research and benchmarking practice.

Abstract

As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative abilities of these models, particularly highlighting the differences between base LLMs and more recent "reasoning" models.

Enhancing LLM Evaluations: The Garbling Trick

TL;DR

The paper introduces the garbling trick, a meta-evaluation technique that converts existing text-based LLM assessments into curves of increasingly difficult tasks by randomly garbling the context with rate , producing that exposes latent reasoning abilities and reduces saturation. They demonstrate the approach by constructing NeoSQuAD, a 10,000-question, three-option MCQA dataset derived from SQuAD 2.0 and augmented with incorrect options, then generate score curves for multiple LLMs, including base and test-time compute-enabled reasoning models. The results show that at modest garbling () model scores separate and reveal reasoning strengths and weaknesses not visible at , with larger reasoning models achieving robust performance in high garble regimes. The work discusses contextual-core artifacts, design choices for core selection, and practical extensions such as temperature effects and structured outputs to reduce invalid responses, highlighting the method’s potential to guide long-horizon evaluation research and benchmarking practice.

Abstract

As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative abilities of these models, particularly highlighting the differences between base LLMs and more recent "reasoning" models.

Paper Structure

This paper contains 7 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: NeoSQuAD score curves across eight traditional (non-reasoning) LLMs, normalized by the number of questions answered instead of the number of answers parsed. The shaded region around each curve represents $\pm 1\sigma$ confidence intervals.
  • Figure 2: NeoSQuAD score curves across four reasoning LLMs, normalized by the number of questions answered instead of the number of answers parsed. The shaded region around each curve represents $\pm 1\sigma$ confidence intervals. The performance of the eight non-reasoning models is faintly shown in grey for comparison.
  • Figure 3: NeoSQuAD score curves normalized by the number of questions asked instead of the number of answers parsed.
  • Figure 4: Rate of invalid answers while computing NeoSQuAD score curves.