Table of Contents
Fetching ...

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu

TL;DR

S1-Bench is a bilingual, multi-domain benchmark crafted to evaluate system 1 thinking in large reasoning models by focusing on simple, intuitive questions. The authors provide a detailed construction workflow with a priori constraints and a posteriori verification, and they evaluate 28 LRMs across two languages using comprehensive format, efficiency, and accuracy metrics. The results reveal that LRMs are generally inefficient, sometimes under-accurate, and exhibit robustness gaps on simple tasks, along with a notable phenomenon they label the 'gut moment'—an early perceived difficulty signal. The work lays groundwork for dual-system compatibility by highlighting where LRMs fail to align difficulty perception with generation behavior and proposing a scalable workflow for expanding the benchmark and improving efficiency.

Abstract

We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

TL;DR

S1-Bench is a bilingual, multi-domain benchmark crafted to evaluate system 1 thinking in large reasoning models by focusing on simple, intuitive questions. The authors provide a detailed construction workflow with a priori constraints and a posteriori verification, and they evaluate 28 LRMs across two languages using comprehensive format, efficiency, and accuracy metrics. The results reveal that LRMs are generally inefficient, sometimes under-accurate, and exhibit robustness gaps on simple tasks, along with a notable phenomenon they label the 'gut moment'—an early perceived difficulty signal. The work lays groundwork for dual-system compatibility by highlighting where LRMs fail to align difficulty perception with generation behavior and proposing a scalable workflow for expanding the benchmark and improving efficiency.

Abstract

We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

Paper Structure

This paper contains 48 sections, 2 equations, 10 figures, 25 tables.

Figures (10)

  • Figure 1: Construction workflow for S1-Bench and an illustrative example from each major category.
  • Figure 2: Statistical distribution of token counts for S1-Bench questions.
  • Figure 3: (a) Comparison of first round and additional token costs for each LRM. (b) Distribution of solution rounds for each LRM.
  • Figure 4: Distribution of the thinking process across four categories. FA and TP refer to Final Answer and Thinking Process, respectively. Green bars indicate cases where the final answer is correct, while red bars indicate cases where it is incorrect.
  • Figure 5: Top: Count of "gut moments" across models. Bottom: Probability of "gut moments" by question type.
  • ...and 5 more figures