Table of Contents
Fetching ...

Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

TL;DR

This work challenges the assumption that bigger pretraining models always improve reasoning by showing a U-shaped relation between model size and implicit reasoning performance in a controlled knowledge-graph setting. Through synthetic data generation and controlled ablations, the authors introduce graph search entropy $H(G)$ as a measure of reasoning complexity and establish an empirical scaling law linking optimal model size to $H(G)$, suggesting about $0.008$ bits of information can be effectively processed per parameter for reasoning. The key contribution is the demonstration that optimal reasoning capability during pretraining is governed by the combinatorial structure of the knowledge graph rather than sheer parameter count or training steps, with real-world data (FB15K-237) supporting the predictive utility of the law. The findings offer practical insights for pretraining design and graph-based data construction, highlighting a boundary where memorization overtakes reasoning and indicating directions for future validation on large-scale corpora.

Abstract

Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.

Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time

TL;DR

This work challenges the assumption that bigger pretraining models always improve reasoning by showing a U-shaped relation between model size and implicit reasoning performance in a controlled knowledge-graph setting. Through synthetic data generation and controlled ablations, the authors introduce graph search entropy as a measure of reasoning complexity and establish an empirical scaling law linking optimal model size to , suggesting about bits of information can be effectively processed per parameter for reasoning. The key contribution is the demonstration that optimal reasoning capability during pretraining is governed by the combinatorial structure of the knowledge graph rather than sheer parameter count or training steps, with real-world data (FB15K-237) supporting the predictive utility of the law. The findings offer practical insights for pretraining design and graph-based data construction, highlighting a boundary where memorization overtakes reasoning and indicating directions for future validation on large-scale corpora.

Abstract

Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The multiple-choice accuracy/loss on unseen triples of different-sized LMs trained on a real-word knowledge graph FB15K-237. The first column shows that the testing accuracy decreases after a certain model size. The second column shows U-shape loss curves of LMs trained with different numbers of steps. The third column shows the training loss decreases steadily. These trends are stable across different ways of processing the knowledge triples, with the triple-only data shows the cleanest trend. Note that the model size on x-axis is in log scale.
  • Figure 2: Nine possible node types generated by two logical rules. Each entity position in a rule would create a new entity type. Each relation shared between two rules would also create two new entity types.
  • Figure 3: We show the effect of different hyperparameters of the synthetic knowledge graph generation process. In each experiment, we keep all other parameters the same and only change one hyperparameter. We show the effect with both the testing accuracy (left) and the testing loss (right) as the y-axis, with different model sizes as the x-axis in log scale.
  • Figure 4: The optimal model size with the lowest possible testing loss v.s. the graph search entropy. The red line is the linear regression line using data from the synthetic experiments (blue squares), with a 95% confidence interval. We also plot the graph search entropy and optimal model size from the real-world FB15K-237 experiment (green dot) to verify the accuracy of the obtained linear scaling law.
  • Figure 5: An example of a triple being processed in three different ways.
  • ...and 2 more figures