Generative Evaluation of Complex Reasoning in Large Language Models
Haowei Lin, Xiangyu Wang, Ruilin Yan, Baizhou Huang, Haotian Ye, Jianhua Zhu, Zihao Wang, James Zou, Jianzhu Ma, Yitao Liang
TL;DR
The paper tackles the challenge of evaluating genuine reasoning in large language models (LLMs) amid concerns of training-data contamination in public benchmarks. It introduces KUMO, a generative evaluation framework that couples LLMs with a symbolic SAT-based engine to automatically generate diverse, partially observable reasoning tasks across 100 domains, with a knowledge book to separate reasoning from domain knowledge. The pipeline comprises domain proposal, seed configuration generation, SAT-based task construction, knowledge-book creation, and automated evaluation, plus an optimal search algorithm to minimize the required actions. Empirically, 23 LLMs are benchmarked on 5,000 tasks across five environments, revealing that many models exceed university-level performance on easy tasks and that reasoning-enabled models approach or surpass human performance on harder tasks, with strong correlations to real-world benchmarks and demonstrated resistance to data contamination. Overall, KUMO offers a scalable, contamination-resistant framework for assessing genuine LLM reasoning and generalization in open-ended domains, with publicly available data and code to support broad adoption.
Abstract
With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.
