LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacsu, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
TL;DR
LingOly-TOO introduces a reasoning benchmark that obfuscates linguistic problems to preserve underlying reasoning steps while reducing reliance on internalised knowledge. By combining linguistically informed permutation rules with exact-match evaluation, it reveals that current frontier models rely on shortcuts and their reasoning remains brittle, though Inference-Time Compute models show notable gains. The work provides a framework for disentangling reasoning from memorisation in language tasks, validated by expert reviews and human experiments, and highlights language-resource effects on robustness. Overall, the benchmark advances measurement of genuine reasoning in LLMs and cautions against overestimating reasoning capabilities when knowledge shortcuts are possible. The approach offers a path toward more reliable assessments and guidance for improving robust, generalisable reasoning in frontier models.
Abstract
The expanding knowledge and memorisation capacity of frontier language models allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that Large Language Models' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.
