Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Alireza S. Ziabari; Nona Ghazizadeh; Zhivar Sourati; Farzan Karimi-Malekabadi; Payam Piray; Morteza Dehghani

Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani

TL;DR

The study investigates whether large language models should be restricted to a single reasoning style or guided to use both fast, heuristic (System $\mathcal{S}1$) and slow, analytical (System $\mathcal{S}2$) thinking. It builds a training-free, entropy-based arbitration framework that selects between $\mathcal{S}1$ and $\mathcal{S}2$ outputs based on uncertainty signals, and applies it to a curated dataset of 2,000 dual-style questions generated with expert input. Results show a distinct accuracy–efficiency trade-off: $\mathcal{S}2$ excels in arithmetic and symbolic tasks, while $\mathcal{S}1$ excels in commonsense tasks; interpolating between styles yields monotonic performance changes, and a dynamic, entropy-guided combination improves results across benchmarks. The findings argue for task-dependent, adaptive reasoning in LLMs and provide a practical, training-free method to realize such adaptability in deployment.

Abstract

Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. In contrast, human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context. This difference between human cognitive flexibility and LLMs' reliance on a single reasoning style raises a critical question: while human fast heuristic reasoning evolved for its efficiency and adaptability, is a uniform reasoning approach truly optimal for LLMs, or does its inflexibility make them brittle and unreliable when faced with tasks demanding more agile, intuitive responses? To answer these questions, we explicitly align LLMs to these reasoning styles by curating a dataset with valid System 1 and System 2 answers, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense reasoning tasks. To analyze the reasoning spectrum, we interpolated between the two extremes by varying the proportion of alignment data, which resulted in a monotonic change in accuracy. A mechanistic analysis of model responses shows that System 1 models employ more definitive outputs, whereas System 2 models demonstrate greater uncertainty. Building on these findings, we further combine System 1- and System 2-aligned models based on the entropy of their generations, without additional training, and obtain a dynamic model that outperforms across nearly all benchmarks. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.

Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

TL;DR

Abstract

Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)