Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time
Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen
TL;DR
This work challenges the assumption that bigger pretraining models always improve reasoning by showing a U-shaped relation between model size and implicit reasoning performance in a controlled knowledge-graph setting. Through synthetic data generation and controlled ablations, the authors introduce graph search entropy $H(G)$ as a measure of reasoning complexity and establish an empirical scaling law linking optimal model size to $H(G)$, suggesting about $0.008$ bits of information can be effectively processed per parameter for reasoning. The key contribution is the demonstration that optimal reasoning capability during pretraining is governed by the combinatorial structure of the knowledge graph rather than sheer parameter count or training steps, with real-world data (FB15K-237) supporting the predictive utility of the law. The findings offer practical insights for pretraining design and graph-based data construction, highlighting a boundary where memorization overtakes reasoning and indicating directions for future validation on large-scale corpora.
Abstract
Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.
