Table of Contents
Fetching ...

PLDR-LLMs Reason At Self-Organized Criticality

Burc Gokden

Abstract

We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model's deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.

PLDR-LLMs Reason At Self-Organized Criticality

Abstract

We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model's deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.
Paper Structure (15 sections, 1 equation, 14 figures, 9 tables)

This paper contains 15 sections, 1 equation, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Train loss (a) and accuracy (b) curves for the PLDR-LLMs pretrained near-critical and sub-critical conditions. Each data point was captured as a running average of 2000 steps. To get the actual total number of steps the measurement was taken at, multiply Step Index with $\times$10000 for PLDRv51-SOC-110M-5, and with $\times$2000 for other models.
  • Figure 2: Deductive output probability density distributions for all values in a model for PLDRv51-SOC-110M-4 and SUB-SOC-110M-2 binned in 100 buckets. The ${\bm{\mathsfit{A}}}_{\textbf{P}}$ and ${\bm{\mathsfit{G}}}_{LM}$ were plotted up to $\pm5\sigma$ for easier visibility of main distribution characteristics. ${\bm{\mathsfit{A}}}$ and ${\bm{\mathsfit{A}}}_{LM}$ distributions were plotted as log-linear.
  • Figure 3: (Cont.) Deductive output probability density distributions for all values in a model for PLDRv51-SOC-110M-4 and SUB-SOC-110M-2 binned in 100 buckets. The ${\bm{\mathsfit{A}}}_{\textbf{P}}$ and ${\bm{\mathsfit{G}}}_{LM}$ were plotted up to $\pm5\sigma$ for easier visibility of main distribution characteristics. ${\bm{\mathsfit{A}}}$ and ${\bm{\mathsfit{A}}}_{LM}$ distributions were plotted as log-linear. The heatmaps of ${\bm{\mathsfit{A}}}$ were averaged over all samples for same layer and head.
  • Figure 4: Train loss (a) and accuracy (b) curves for the PLDR-LLMs pretrained as ablation study near-critical and sub-critical conditions and exhibiting dragon king events. Each data point was captured as a running average of 2000 steps.
  • Figure 5: ${\bm{\mathsfit{A}}}$ probability density distributions for all models binned in 100 buckets for main plots and insets.
  • ...and 9 more figures