Table of Contents
Fetching ...

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacsu, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi

TL;DR

LingOly-TOO introduces a reasoning benchmark that obfuscates linguistic problems to preserve underlying reasoning steps while reducing reliance on internalised knowledge. By combining linguistically informed permutation rules with exact-match evaluation, it reveals that current frontier models rely on shortcuts and their reasoning remains brittle, though Inference-Time Compute models show notable gains. The work provides a framework for disentangling reasoning from memorisation in language tasks, validated by expert reviews and human experiments, and highlights language-resource effects on robustness. Overall, the benchmark advances measurement of genuine reasoning in LLMs and cautions against overestimating reasoning capabilities when knowledge shortcuts are possible. The approach offers a path toward more reliable assessments and guidance for improving robust, generalisable reasoning in frontier models.

Abstract

The expanding knowledge and memorisation capacity of frontier language models allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that Large Language Models' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

TL;DR

LingOly-TOO introduces a reasoning benchmark that obfuscates linguistic problems to preserve underlying reasoning steps while reducing reliance on internalised knowledge. By combining linguistically informed permutation rules with exact-match evaluation, it reveals that current frontier models rely on shortcuts and their reasoning remains brittle, though Inference-Time Compute models show notable gains. The work provides a framework for disentangling reasoning from memorisation in language tasks, validated by expert reviews and human experiments, and highlights language-resource effects on robustness. Overall, the benchmark advances measurement of genuine reasoning in LLMs and cautions against overestimating reasoning capabilities when knowledge shortcuts are possible. The approach offers a path toward more reliable assessments and guidance for improving robust, generalisable reasoning in frontier models.

Abstract

The expanding knowledge and memorisation capacity of frontier language models allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that Large Language Models' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.

Paper Structure

This paper contains 43 sections, 1 equation, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Permutation example. An example problem before (left) and after (right) permutation with the simplified inductive reasoning steps needed for answering. In each permutation, we sample a character mapping based on the ruleset, then apply it to obfuscate the relevant parts of the problem and the answer. Unlike the permuted case, the original problem can be solved with the aid of model's internalised knowledge.
  • Figure 2: Main results on LingOly-TOO. (a) Scores by model. $M_{og}$ is based on the original problems and $M_{obf}$ is based on permuted problems. $M_{obf}$ (Robust) is calculated after taking the worst score across all permutations of the question. (b) Breakdown of $M_{obf}$ scores by difficulty level.
  • Figure 3: Score distribution across bootstrapped samples. Distribution of scores across 500 bootstrapped samples of our data by model. Each consists of 82 problems. Open source models are shown in orange while proprietary models are in blue (for full results, see Appendix \ref{['app:hist_details']}).
  • Figure 4: $\Delta_{obf}$ for each model by problem.$\Delta_{obf}$ for the $6$ permutations of $57$ problems (for brevity), showing performance changes by model. Red indicates a performance drop for that particular permutation, while blue indicates an improvement. Results for all problems are in Appendix \ref{['app:fullresults']}
  • Figure 5: The effect of permutation is larger for high-resource languages. Each point represents all problems associated with a language of specific numbers of speaker. Solid lines are fitted regressions and shaded areas are the $95\%$ confidence intervals.
  • ...and 6 more figures