S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang; Shuaiyi Nie; Xinghua Zhang; Zefeng Zhang; Tingwen Liu

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu

TL;DR

S1-Bench is a bilingual, multi-domain benchmark crafted to evaluate system 1 thinking in large reasoning models by focusing on simple, intuitive questions. The authors provide a detailed construction workflow with a priori constraints and a posteriori verification, and they evaluate 28 LRMs across two languages using comprehensive format, efficiency, and accuracy metrics. The results reveal that LRMs are generally inefficient, sometimes under-accurate, and exhibit robustness gaps on simple tasks, along with a notable phenomenon they label the 'gut moment'—an early perceived difficulty signal. The work lays groundwork for dual-system compatibility by highlighting where LRMs fail to align difficulty perception with generation behavior and proposing a scalable workflow for expanding the benchmark and improving efficiency.

Abstract

We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

TL;DR

Abstract

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)