Table of Contents
Fetching ...

Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani

TL;DR

The study investigates whether large language models should be restricted to a single reasoning style or guided to use both fast, heuristic (System $\mathcal{S}1$) and slow, analytical (System $\mathcal{S}2$) thinking. It builds a training-free, entropy-based arbitration framework that selects between $\mathcal{S}1$ and $\mathcal{S}2$ outputs based on uncertainty signals, and applies it to a curated dataset of 2,000 dual-style questions generated with expert input. Results show a distinct accuracy–efficiency trade-off: $\mathcal{S}2$ excels in arithmetic and symbolic tasks, while $\mathcal{S}1$ excels in commonsense tasks; interpolating between styles yields monotonic performance changes, and a dynamic, entropy-guided combination improves results across benchmarks. The findings argue for task-dependent, adaptive reasoning in LLMs and provide a practical, training-free method to realize such adaptability in deployment.

Abstract

Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. In contrast, human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context. This difference between human cognitive flexibility and LLMs' reliance on a single reasoning style raises a critical question: while human fast heuristic reasoning evolved for its efficiency and adaptability, is a uniform reasoning approach truly optimal for LLMs, or does its inflexibility make them brittle and unreliable when faced with tasks demanding more agile, intuitive responses? To answer these questions, we explicitly align LLMs to these reasoning styles by curating a dataset with valid System 1 and System 2 answers, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense reasoning tasks. To analyze the reasoning spectrum, we interpolated between the two extremes by varying the proportion of alignment data, which resulted in a monotonic change in accuracy. A mechanistic analysis of model responses shows that System 1 models employ more definitive outputs, whereas System 2 models demonstrate greater uncertainty. Building on these findings, we further combine System 1- and System 2-aligned models based on the entropy of their generations, without additional training, and obtain a dynamic model that outperforms across nearly all benchmarks. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.

Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

TL;DR

The study investigates whether large language models should be restricted to a single reasoning style or guided to use both fast, heuristic (System ) and slow, analytical (System ) thinking. It builds a training-free, entropy-based arbitration framework that selects between and outputs based on uncertainty signals, and applies it to a curated dataset of 2,000 dual-style questions generated with expert input. Results show a distinct accuracy–efficiency trade-off: excels in arithmetic and symbolic tasks, while excels in commonsense tasks; interpolating between styles yields monotonic performance changes, and a dynamic, entropy-guided combination improves results across benchmarks. The findings argue for task-dependent, adaptive reasoning in LLMs and provide a practical, training-free method to realize such adaptability in deployment.

Abstract

Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. In contrast, human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context. This difference between human cognitive flexibility and LLMs' reliance on a single reasoning style raises a critical question: while human fast heuristic reasoning evolved for its efficiency and adaptability, is a uniform reasoning approach truly optimal for LLMs, or does its inflexibility make them brittle and unreliable when faced with tasks demanding more agile, intuitive responses? To answer these questions, we explicitly align LLMs to these reasoning styles by curating a dataset with valid System 1 and System 2 answers, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense reasoning tasks. To analyze the reasoning spectrum, we interpolated between the two extremes by varying the proportion of alignment data, which resulted in a monotonic change in accuracy. A mechanistic analysis of model responses shows that System 1 models employ more definitive outputs, whereas System 2 models demonstrate greater uncertainty. Building on these findings, we further combine System 1- and System 2-aligned models based on the entropy of their generations, without additional training, and obtain a dynamic model that outperforms across nearly all benchmarks. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.

Paper Structure

This paper contains 44 sections, 3 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: (A) Sample of dataset with System 1 and System 2 answers. (B) Overview of our alignment approach with fast and slow thinking. (C) Overview of our dynamic entropy based selection method.
  • Figure 2: Token difference between System 1 and System 2 responses relative to Llama model across prompting stages and alignment methods.
  • Figure 3: (A) Log probabilities of models' reasoning indicating internal uncertainty; (B) Hedge word ratio showing surface-level uncertainty; (C) Proportion of definitive answers in the first n sentences.
  • Figure 4: Accuracy across benchmark categories as reasoning shifts from System 1 to System 2.
  • Figure 5: (A) Performance of Llama models (DPO- and SimPO-dynamic models) on the GSM8K dataset as $w$ varies in \ref{['eq:combined_score']}. The dashed line represents the accuracy of the base Llama model. (B) Violin plots of average entropy ($\bar{H}$) and its variance ($\sigma^2$) distribution for DPO-aligned Llama models on GSM8K, broken down by four possible outcomes.
  • ...and 8 more figures