EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma; Paras Chopra

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma, Paras Chopra

TL;DR

EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.

Abstract

Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

TL;DR

Abstract

Paper Structure (63 sections, 7 figures, 10 tables)

This paper contains 63 sections, 7 figures, 10 tables.

Introduction
Related Work
Code Generation Benchmarks
Benchmark Contamination and Gaming
Out-of-Distribution Generalization
Advanced Prompting and Reasoning
The EsoLang-Bench Dataset
Dataset Design
Sample Problems
Target Languages
Data Scarcity as a Feature
Why This Benchmark Matters
Benchmark Design Philosophy
The Benchmark Gaming Cycle
Measuring Reasoning vs. Retrieval
...and 48 more sections

Figures (7)

Figure 1: Average accuracy across all five esoteric languages by model and prompting strategy. Self-Scaffolding consistently achieves the highest accuracy, with GPT-5.2 reaching 6.2%. All models perform below 7% even with advanced scaffolding.
Figure 2: EsoLang-Bench Overview.Left: The benchmark comprises five esoteric programming languages spanning diverse computational paradigms, with 80 problems across four difficulty tiers (400 total evaluations). Right: Evaluation pipeline testing five frontier models across multiple prompting strategies, with automated interpreter-based verification. The best model achieves only 3.8% accuracy compared to 100% on equivalent Python problems.
Figure 3: Training data scarcity (log scale). Esoteric languages have 5,000$\times$ fewer GitHub repositories than Python.
Figure 4: Error distribution by language (GPT-5.2 zero-shot). BF=Brainfuck, Bef=Befunge-98, WS=Whitespace, Unl=Unlambda, Shk=Shakespeare. Whitespace and Unlambda show near-total compile failure; Brainfuck shows primarily logic errors.
Figure 5: Best accuracy achieved per language (across all models and strategies). Befunge-98 is the most tractable (11.2%), while Whitespace remains completely unsolved (0%).
...and 2 more figures

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

TL;DR

Abstract

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (7)