Table of Contents
Fetching ...

Evolutionary System 2 Reasoning: An Empirical Proof

Zeyuan Ma, Wenqi Huang, Guo-Huan Song, Hongshu Guo, Sijie Ma, Zhiguang Cao, Yue-Jiao Gong

TL;DR

The paper addresses the limitation of current LLMs in general System 2 reasoning and introduces Evolutionary Reasoning Optimization (ERO), an island-based mu+lambda evolutionary strategy to evolve LLM parameters toward improved reasoning on ARC tasks. It demonstrates that a relatively small model (Qwen-7B) can be evolved to achieve reasoning performance competitive with GPT-5 on several ARC tasks, suggesting that evolution can unlock general reasoning without mere scaling. Key contributions include a scalable island-based ES, a layer-wise covariance sampling approach, a flexible ARC-oriented scoring function, and practical mechanisms like Ray acceleration and cache optimization to enable large-population evolution. The work highlights the potential of evolutionary search to cultivate reasoning abilities in LLMs and outlines future directions for meta-evolution across task distributions.

Abstract

Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.

Evolutionary System 2 Reasoning: An Empirical Proof

TL;DR

The paper addresses the limitation of current LLMs in general System 2 reasoning and introduces Evolutionary Reasoning Optimization (ERO), an island-based mu+lambda evolutionary strategy to evolve LLM parameters toward improved reasoning on ARC tasks. It demonstrates that a relatively small model (Qwen-7B) can be evolved to achieve reasoning performance competitive with GPT-5 on several ARC tasks, suggesting that evolution can unlock general reasoning without mere scaling. Key contributions include a scalable island-based ES, a layer-wise covariance sampling approach, a flexible ARC-oriented scoring function, and practical mechanisms like Ray acceleration and cache optimization to enable large-population evolution. The work highlights the potential of evolutionary search to cultivate reasoning abilities in LLMs and outlines future directions for meta-evolution across task distributions.

Abstract

Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.

Paper Structure

This paper contains 11 sections, 3 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: An intuitive comparison between the evolution paths of human beings and machine intelligence.
  • Figure 2: A reasoning task example in ARC benchmark.
  • Figure 3: System prompt and User prompt we used across all baselines.
  • Figure 4: Evolution curve of ERO on ARC benchmark.
  • Figure 5: Showcases on the effectiveness our ERO for boosting the understanding and reasoning ability of LLMs.