Table of Contents
Fetching ...

Generative Evaluation of Complex Reasoning in Large Language Models

Haowei Lin, Xiangyu Wang, Ruilin Yan, Baizhou Huang, Haotian Ye, Jianhua Zhu, Zihao Wang, James Zou, Jianzhu Ma, Yitao Liang

TL;DR

The paper tackles the challenge of evaluating genuine reasoning in large language models (LLMs) amid concerns of training-data contamination in public benchmarks. It introduces KUMO, a generative evaluation framework that couples LLMs with a symbolic SAT-based engine to automatically generate diverse, partially observable reasoning tasks across 100 domains, with a knowledge book to separate reasoning from domain knowledge. The pipeline comprises domain proposal, seed configuration generation, SAT-based task construction, knowledge-book creation, and automated evaluation, plus an optimal search algorithm to minimize the required actions. Empirically, 23 LLMs are benchmarked on 5,000 tasks across five environments, revealing that many models exceed university-level performance on easy tasks and that reasoning-enabled models approach or surpass human performance on harder tasks, with strong correlations to real-world benchmarks and demonstrated resistance to data contamination. Overall, KUMO offers a scalable, contamination-resistant framework for assessing genuine LLM reasoning and generalization in open-ended domains, with publicly available data and code to support broad adoption.

Abstract

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

Generative Evaluation of Complex Reasoning in Large Language Models

TL;DR

The paper tackles the challenge of evaluating genuine reasoning in large language models (LLMs) amid concerns of training-data contamination in public benchmarks. It introduces KUMO, a generative evaluation framework that couples LLMs with a symbolic SAT-based engine to automatically generate diverse, partially observable reasoning tasks across 100 domains, with a knowledge book to separate reasoning from domain knowledge. The pipeline comprises domain proposal, seed configuration generation, SAT-based task construction, knowledge-book creation, and automated evaluation, plus an optimal search algorithm to minimize the required actions. Empirically, 23 LLMs are benchmarked on 5,000 tasks across five environments, revealing that many models exceed university-level performance on easy tasks and that reasoning-enabled models approach or surpass human performance on harder tasks, with strong correlations to real-world benchmarks and demonstrated resistance to data contamination. Overall, KUMO offers a scalable, contamination-resistant framework for assessing genuine LLM reasoning and generalization in open-ended domains, with publicly available data and code to support broad adoption.

Abstract

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

Paper Structure

This paper contains 6 sections, 7 equations, 14 figures, 2 tables, 3 algorithms.

Figures (14)

  • Figure 1: Overview of KUMO tasks. a. An example of the complex reasoning game. In this game, the player is presented with a list of potential "truths", available "actions", and a knowledge guidebook for a specific scenario. In the illustrated case of a diagnostic test scenario, the "truths" represent diseases, and the "actions" correspond to diagnostic tests. During each round, the player selects one action, observes its "outcome", and uses the information to eliminate invalid truths. The objective is to identify the single valid truth using the fewest possible actions. b. The generated tasks in KUMO. This study employs an automated pipeline to generate 100 exemplar task environments across 18 topic categories. Each environment includes approximately 50 truths and 30 actions. The figure shows part of the truths and actions from the Medical environment, which corresponds to the scenario depicted in panel a.
  • Figure 2: The construction of KUMO benchmark consists of five stages: a. Domain proposal. A capable LLM is prompted to propose various scenarios for the complex game based on its definition. These scenarios, referred to as domains, are collected. b. Seed config generation. The LLM is further prompted to generate foundational elements for each domain, including truths, actions, and their corresponding outcomes. These outcomes are designed to rule out certain truths. c. Task instance generation. To create a specific task instance, the sizes of its candidate truth set and action set are first determined. A subset of truths is then sampled from the universal truth set, with one selected as valid while the others are treated as invalid. The generation of compatible actions and outcomes is modeled as a satisfiability (SAT) problem. An SAT-based engine is employed to sample the action subset and generate outcomes. This process involves extracting related outcomes for each truth, assigning logical values based on validity, and using a SAT solver to produce a viable solution. d. Knowledge book generation. Once a task instance is generated, an LLM is tasked with writing a knowledge book and revising it if any error detected. This book translates the outcome configurations associated with the sampled truth and action subsets into detailed natural language descriptions. e. Evaluation. In each round, the player takes actions or makes truth prediction, and a simulator provides observations for the action based on the outcomes of the task (which is unseen to the player). The goal is to achieve accurate truth prediction while minimizing the number of actions taken.
  • Figure 3: Benchmark results for 100 domains in the Easy setting (#Truths=4, #Actions=6) using KUMO for open-sourced Large Language Models (LLMs). Left panel: Success rates of LLMs, ranked from highest to lowest from left to right. Right panel: Relative action counts of LLMs. Domains are ranked from top to bottom based on the average metrics across LLMs.
  • Figure 4: Benchmarking Large Language Models (LLMs) on KUMO and correlation with other LLM benchmarks. We evaluate 23 state-of-the-art LLMs varying in parameter counts, architectures, and organizational origins across five environments: MedicalEnv, ChemicalEnv, EducationEnv, FantasyEnv, and MusicEnv. Each environment has two difficulty levels: Easy (#Truths=4, #Actions=6) and Hard (#Truths=12, #Actions=16). a. Success rate and relative action count metrics for the Easy setting. b. Success rate and relative action count metrics for the Hard setting. Pearson Correlation of LLM performance between KUMO and c. MMLU-Pro benchmark, d. LongBench-V2 benchmark, and e. LiveBench-Reason benchmark.
  • Figure 5: Performance of Large Language Models (LLMs) fine-tuned on golden trajectories. The MedicalEnv environment is divided into MedicalINDEnv (in-distribution) and MedicalOODEnv (out-of-distribution), each with distinct connection components. Two LLMs, Qwen2.5-0.5B-Instruct and Qwen2.5-3B-Instruct, are fine-tuned on golden trajectories within MedicalINDEnv under Easy (#Truths=4, #Actions=6) and Hard (#Truths=12, #Actions=16) settings. a. Success rate and relative action count metrics for the Easy setting. b. Success rate and relative action count metrics for the Hard setting. Fine-tuned LLMs exhibit strong in-distribution (IND) generalization but experience severe performance degradation for out-of-domain (OOD) generalization and difficulty transitions (Easy to Hard / Hard to Easy). This demonstrates the benchmark's resistance to overfitting through diverse setting generation.
  • ...and 9 more figures