ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang
TL;DR
ThinkBench presents a robust dynamic OOD benchmark for evaluating LLM reasoning, addressing data contamination and leakage in benchmarks. It dynamically generates 2,912 OOD samples from math and science reasoning tasks (AIME-500, AIME-2024, GPQA Diamond) using scenario-level and attack-level semi-fact data, enabling joint evaluation of reasoning and non-reasoning models. Empirical results across 16 LLMs and 4 Process Reward Models (PRMs) reveal pervasive ID→OOD performance gaps and evidence of data leakage in older datasets, with dynamic OOD data providing a more reliable measurement of reasoning ability and test-time scalability. The findings highlight that larger, more capable models tend to be more robust, and advanced PRMs (e.g., Skywork-PRM, Qwen-PRM) offer superior performance under increased test-time budgets. ThinkBench thus offers a practical, scalable benchmark for robust reasoning in LLMs and points to directions for broadening task coverage and richer scenario generation.
Abstract
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
