Table of Contents
Fetching ...

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

Liangkai Hang, Junjie Yao, Zhiwei Bai, Tianyi Chen, Yang Chen, Rongjie Diao, Hezhou Li, Pengxiao Lin, Zhiwei Wang, Cheng Xu, Zhongwang Zhang, Zhangchen Zhou, Zhiyu Li, Zehao Lin, Kai Chen, Feiyu Xiong, Yaoyu Zhang, Weinan E, Hongkang Yang, Zhi-Qin John Xu

TL;DR

This work investigates how to enhance LLM reasoning and generalization by controlling model complexity through two scalable hyperparameters: the initialization rate $\gamma$ and weight decay $\lambda$. It demonstrates that maintaining a constant initialization rate across width (a form of complexity control) yields faster improvements in scaling laws with respect to both model size and data size, and delivers substantial gains on reasoning benchmarks. Through empirical results on 0.9B and 2.4B parameter models, plus analyses of embedding spaces and attention matrices, the study shows that smaller effective complexity promotes deeper, more generalizable reasoning (evidenced by condensation and low-rank attention) and that larger regularization ($\lambda$) stabilizes training at scale. Overall, complexity control emerges as a practical and principled approach to advance the reasoning capabilities of LLMs during pretraining.

Abstract

The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

TL;DR

This work investigates how to enhance LLM reasoning and generalization by controlling model complexity through two scalable hyperparameters: the initialization rate and weight decay . It demonstrates that maintaining a constant initialization rate across width (a form of complexity control) yields faster improvements in scaling laws with respect to both model size and data size, and delivers substantial gains on reasoning benchmarks. Through empirical results on 0.9B and 2.4B parameter models, plus analyses of embedding spaces and attention matrices, the study shows that smaller effective complexity promotes deeper, more generalizable reasoning (evidenced by condensation and low-rank attention) and that larger regularization () stabilizes training at scale. Overall, complexity control emerges as a practical and principled approach to advance the reasoning capabilities of LLMs during pretraining.

Abstract

The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.

Paper Structure

This paper contains 34 sections, 8 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Next-token prediction accuracy with varying model complexity. The colors indicate the probability of each token predicted by the models Model complexity decreases from left to right, as the initialization rate $\gamma=0.1,0.5,1$ (see Section \ref{['sec:cc']} for the definition of $\gamma$). These models have the same shape (180M parameters) and are trained the same dataset (40B tokens).
  • Figure 2: Test loss across varying data and model scales under different complexity configurations. Left: Test loss progression for 0.8B-parameter models trained with data scales ranging from 0.2B to 1.4B tokens. Right: Test loss versus model parameter counts (50M-0.8B) with fixed 1B training tokens. Line colors correspond to different complexity configurations.
  • Figure 3: Performance improvement via SFT across complexity configurations(0.9B model). It is quantified as the performance gap between the SFT model and the corresponding base model.
  • Figure 4: Parameter norm evolution across complexity configurations. Left to right: $\lambda = 0, 0.1, 1$; Line colors correspond to $\gamma = 0.1,0.3, 0.5,0.8,1$.
  • Figure 5: (A) Evaluation scores (average, GSM8K, HellaSwag) under varying complexities. Top: Performance landscape across $\gamma-\lambda$ with color indicating score (dark: low, light: high). Bottom: Score-complexity relationships with points indicating the models and the dashed lines denoting baseline performance levels. (B) Task-specific Spearman correlations between model complexity and task score. Stronger negative correlations (approaching -1) indicate greater performance enhancement through complexity control.
  • ...and 11 more figures