Table of Contents
Fetching ...

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang

TL;DR

ThinkBench presents a robust dynamic OOD benchmark for evaluating LLM reasoning, addressing data contamination and leakage in benchmarks. It dynamically generates 2,912 OOD samples from math and science reasoning tasks (AIME-500, AIME-2024, GPQA Diamond) using scenario-level and attack-level semi-fact data, enabling joint evaluation of reasoning and non-reasoning models. Empirical results across 16 LLMs and 4 Process Reward Models (PRMs) reveal pervasive ID→OOD performance gaps and evidence of data leakage in older datasets, with dynamic OOD data providing a more reliable measurement of reasoning ability and test-time scalability. The findings highlight that larger, more capable models tend to be more robust, and advanced PRMs (e.g., Skywork-PRM, Qwen-PRM) offer superior performance under increased test-time budgets. ThinkBench thus offers a practical, scalable benchmark for robust reasoning in LLMs and points to directions for broadening task coverage and richer scenario generation.

Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

TL;DR

ThinkBench presents a robust dynamic OOD benchmark for evaluating LLM reasoning, addressing data contamination and leakage in benchmarks. It dynamically generates 2,912 OOD samples from math and science reasoning tasks (AIME-500, AIME-2024, GPQA Diamond) using scenario-level and attack-level semi-fact data, enabling joint evaluation of reasoning and non-reasoning models. Empirical results across 16 LLMs and 4 Process Reward Models (PRMs) reveal pervasive ID→OOD performance gaps and evidence of data leakage in older datasets, with dynamic OOD data providing a more reliable measurement of reasoning ability and test-time scalability. The findings highlight that larger, more capable models tend to be more robust, and advanced PRMs (e.g., Skywork-PRM, Qwen-PRM) offer superior performance under increased test-time budgets. ThinkBench thus offers a practical, scalable benchmark for robust reasoning in LLMs and points to directions for broadening task coverage and richer scenario generation.

Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

Paper Structure

This paper contains 21 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Example of ThinkBench datasets containing Scenario-level and Attack-level semi-fact data.
  • Figure 2: Math Reasoning Gap: Most models demonstrate a visible performance gap between their math reasoning performance on ID and OOD, including open-source models and commercial models.
  • Figure 3: Overview of ThinkBench framework. Based on the original data, ThinkBench dynamically generates scenario-level Semi-fact Data (a) and Attack-level Semi-fact Data (b), which can be used to evaluate the robustness of reasoning models and non-reasoning models. ThinkBench can also serve as a useful tool for Test-time Scaling Evaluation(c).
  • Figure 4: The performance gap between ID and OOD test on AIME-500 and AIME 2024. "ID performance" and "OOD performance" represent the accuracy of LLMs in solving problems on the AIME-500 and AIME 2024's original test and OOD test, respectively.
  • Figure 5: Test-time Scaling Law. We show that the model's performance increases on the OOD dataset with the test-time computation budget increases using Qwen2.5-Math-7B-IT as the policy model, along with several PRMs.
  • ...and 2 more figures