Scalable Complexity Control Facilitates Reasoning Ability of LLMs

Liangkai Hang; Junjie Yao; Zhiwei Bai; Tianyi Chen; Yang Chen; Rongjie Diao; Hezhou Li; Pengxiao Lin; Zhiwei Wang; Cheng Xu; Zhongwang Zhang; Zhangchen Zhou; Zhiyu Li; Zehao Lin; Kai Chen; Feiyu Xiong; Yaoyu Zhang; Weinan E; Hongkang Yang; Zhi-Qin John Xu

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

Liangkai Hang, Junjie Yao, Zhiwei Bai, Tianyi Chen, Yang Chen, Rongjie Diao, Hezhou Li, Pengxiao Lin, Zhiwei Wang, Cheng Xu, Zhongwang Zhang, Zhangchen Zhou, Zhiyu Li, Zehao Lin, Kai Chen, Feiyu Xiong, Yaoyu Zhang, Weinan E, Hongkang Yang, Zhi-Qin John Xu

TL;DR

This work investigates how to enhance LLM reasoning and generalization by controlling model complexity through two scalable hyperparameters: the initialization rate $\gamma$ and weight decay $\lambda$. It demonstrates that maintaining a constant initialization rate across width (a form of complexity control) yields faster improvements in scaling laws with respect to both model size and data size, and delivers substantial gains on reasoning benchmarks. Through empirical results on 0.9B and 2.4B parameter models, plus analyses of embedding spaces and attention matrices, the study shows that smaller effective complexity promotes deeper, more generalizable reasoning (evidenced by condensation and low-rank attention) and that larger regularization ($\lambda$) stabilizes training at scale. Overall, complexity control emerges as a practical and principled approach to advance the reasoning capabilities of LLMs during pretraining.

Abstract

The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

TL;DR

This work investigates how to enhance LLM reasoning and generalization by controlling model complexity through two scalable hyperparameters: the initialization rate

and weight decay

. It demonstrates that maintaining a constant initialization rate across width (a form of complexity control) yields faster improvements in scaling laws with respect to both model size and data size, and delivers substantial gains on reasoning benchmarks. Through empirical results on 0.9B and 2.4B parameter models, plus analyses of embedding spaces and attention matrices, the study shows that smaller effective complexity promotes deeper, more generalizable reasoning (evidenced by condensation and low-rank attention) and that larger regularization (

) stabilizes training at scale. Overall, complexity control emerges as a practical and principled approach to advance the reasoning capabilities of LLMs during pretraining.

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

TL;DR

Abstract

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)