WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Zipeng Ling; Yuehao Tang; Shuliang Liu; Junqi Yang; Shenghong Fu; Chen Huang; Kejia Huang; Yao Wan; Zhichao Hou; Xuming Hu

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

TL;DR

WakenLLM investigates the Vague Perception phenomenon where LLMs output Unknown on verifiable or solvable reasoning tasks. It introduces a fine-grained benchmark and a two-stage Remind-then-Guide stimulation framework to quantify and awaken latent reasoning, reporting up to $68.53\%$ improved accuracy on VP samples without retraining. Across six LLMs and four benchmarks, the approach reveals substantial latent reasoning capacity that current baselines fail to activate, suggesting larger theoretical upper bounds on reasoning accuracy. The framework provides a principled way to assess and harness an LLM's potential, with implications for trustworthiness, evaluation protocols, and robust reasoning in AI systems.

Abstract

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

TL;DR

Abstract

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)