Intelligence at the Edge of Chaos

Shiyang Zhang; Aakash Patel; Syed A Rizvi; Nianchen Liu; Sizhuang He; Amin Karbasi; Emanuele Zappala; David van Dijk

Intelligence at the Edge of Chaos

Shiyang Zhang, Aakash Patel, Syed A Rizvi, Nianchen Liu, Sizhuang He, Amin Karbasi, Emanuele Zappala, David van Dijk

TL;DR

This work investigates whether intelligence in large language models can emerge from exposure to complex, rule-based data rather than human-intelligent data. By pretraining GPT-2 variants on sequences generated by elementary cellular automata across Wolfram complexity classes and evaluating downstream tasks (ARC-inspired reasoning, Nim, and chess move prediction), the authors reveal a positive link between data complexity and downstream performance, peaking near the edge of chaos (Class IV) where data are structured yet challenging to predict. Attention analyses show that models trained on more complex data rely on longer temporal histories, suggesting the development of nontrivial, transferable representations rather than trivial rule-following. The study highlights implications for data-centric AI development, offering a framework to harness complexity for emergent capabilities and providing reproducible pipelines for future exploration.

Abstract

We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. Our study focuses on elementary cellular automata (ECA), simple yet powerful one-dimensional systems that generate behaviors ranging from trivial to highly complex. By training distinct Large Language Models (LLMs) on different ECAs, we evaluated the relationship between the complexity of the rules' behavior and the intelligence exhibited by the LLMs, as reflected in their performance on downstream tasks. Our findings reveal that rules with higher complexity lead to models exhibiting greater intelligence, as demonstrated by their performance on reasoning and chess move prediction tasks. Both uniform and periodic systems, and often also highly chaotic systems, resulted in poorer downstream performance, highlighting a sweet spot of complexity conducive to intelligence. We conjecture that intelligence arises from the ability to predict complexity and that creating intelligence may require only exposure to complexity.

Intelligence at the Edge of Chaos

TL;DR

Abstract

Paper Structure (32 sections, 8 figures, 1 table)

This paper contains 32 sections, 8 figures, 1 table.

Introduction
Background
Elementary Cellular Automata
Large Language Models
Complexity Measures
Methodology
Data Generation
Training Procedure for GPT-2 Models
Pretraining Setup
Experiments
Downstream Task: Reasoning
Downstream Task: Chess Move Prediction
Hardware and Software
Results
Relationship between Intelligence and Complexity
...and 17 more sections

Figures (8)

Figure 1: Our framework for investigating the link between complexity and intelligence. We pretrain Large Language Models (LLMs) on Elementary Cellular Automata (ECAs) from different complexity classes using next-token prediction, then evaluate them on downstream reasoning and chess move prediction tasks. We use various measures to analyze the complexity of ECA-generated data, and quantify the relationship between complexity and downstream performance.
Figure 2: Relationship between downstream task performance and data complexity. (a) Eight representative ECA rules, two from each of Wolfram's four complexity classes. Performance of models trained on these rules is highlighted in the top row of (b). (b) Top row: Model performance in relation to the Lempel-Ziv complexity of data generated by each rule. The left and center panels show efficiency (1 divided by number of epochs to reach 80% validation accuracy) for the easy and hard reasoning tasks, respectively. The right panel shows move prediction accuracy for the chess task. The rules depicted on the left are highlighted in the plot with triangles and annotated with the rule number. The correlation coefficient is shown in the top-left corner of each plot. An asterisk next to the value indicates a significant relationship ($p < 0.05$). Bottom row: Downstream task performance based on Wolfram classification of each rule. Models trained on Class III and Class IV (chaotic and complex) rules perform better than models trained on uniform and simple rules. Baseline results for a randomly initialized transformer model are shown with a dashed black line on all plots.
Figure 3: Attention scores for the final 10 states prior to the target state, showing that models trained on more complex data rely more heavily on past states for prediction. Left: Visualization of the last 10 states and the target state for representative rules from each of Wolfram's complexity classes. Center: Attention scores for each of the last 10 states, highlighting that models trained on chaotic and complex (Class III and Class IV) rules focus more on recent states, while models trained on uniform rules exhibit consistently low attention. Periodic rules demonstrate a repeating attention pattern, suggesting that the model is learning to attend to earlier cycles of the same state rather than general state history. Right: Average attention across the final 10 states for all rules, plotted against Lempel-Ziv complexity. A strong positive correlation ($r=0.66$) indicates that models trained on higher complexity data attend more highly to historical states during prediction.
Figure 4: Comparison of model performance on short-term (1-step) and long-term (5-step) prediction tasks for ECA rules. Points are colored by Lempel-Ziv complexity, with the dashed line indicating equal performance. Points below the line show better short-term performance.
Figure 5: Scaling experiments with varying quantities of data and different model sizes. Left: Number of tokens seen before convergence during pre-training for the Tiny model. Center: Number of tokens seen before convergence during pretraining for the Small model. Right: Validation loss as a function of token consumption for models trained on data from ECA Rule 110. Larger models achieve lower validation loss with fewer tokens, highlighting the improved data efficiency for larger models.
...and 3 more figures

Intelligence at the Edge of Chaos

TL;DR

Abstract

Intelligence at the Edge of Chaos

Authors

TL;DR

Abstract

Table of Contents

Figures (8)