Table of Contents
Fetching ...

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni

TL;DR

This work theoretically analyze transformers trained on an in-context weight prediction task for linear regression and reveals several key findings, including balanced pretraining data can induce latent capabilities later activated during post-training, and SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals.

Abstract

Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

TL;DR

This work theoretically analyze transformers trained on an in-context weight prediction task for linear regression and reveals several key findings, including balanced pretraining data can induce latent capabilities later activated during post-training, and SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals.

Abstract

Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: balanced pretraining data can induce latent capabilities later activated during post-training, and SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
Paper Structure (26 sections, 11 theorems, 163 equations, 5 figures)

This paper contains 26 sections, 11 theorems, 163 equations, 5 figures.

Key Result

Theorem 4.1

Define We then have $\lim_{\lambda\to0^+} (\widetilde{V}_\lambda,\widetilde{W}_\lambda) = (\widetilde{V}_*,\widetilde{W}_*)$, where

Figures (5)

  • Figure 1: Post-test error as the number or prompts $B$ varies. Here, $d = 400, m= 200$ with different prompt lengths $(n)$. Pre-trained covariance is $\Sigma_0 = \text{diag}(\rho 1_{m}, 1_{d-m})$, $\Delta = \text{diag}(1_m, 0_{n-m})$. Left panel represents the optimal SFT data allocation with covariance $A= \text{diag}( \eta (\rho+1) 1_m, 0_{n-m})$, with $\rho = 0.1$. The right panel represents the case that SFT data distribution interferes with the pretraining distribution. Here, $A = \text{diag}( \eta (\rho+1) 1_m, r 1_{n-m})$, with $r = 0.01$.
  • Figure 2: Behavior of the post-test error as we varying the prompt length $n$, under the same setup as in Figure \ref{['fig:main_figure']}.
  • Figure 3: GPT-2 experiments: Test loss for (a)-(c) post-training with SFT, and (d)-(f) post-training with Outcome Supervision (OS). For SFT, there is a turning point where larger sample size ($B$) and context-length ($n$) hurt the performance. In contrast, for OS larger $B, n$ improves the performance.
  • Figure 4: LSA experiments: Test loss for (a)-(c) post-training with SFT, and (d)-(f) post-training with Outcome Supervision (OS). For SFT, there is a turning point where larger sample size ($B$) and context-length ($n$) hurt the performance. In contrast, for OS larger $B, n$ improves the performance.
  • Figure 5: Comparison between theoretical prediction of the asymptotic error ${\sf Err}(\widetilde{V})$, the simulation results for ${\sf Err}(\widetilde{V})$ and ${\sf Err}(\widetilde{V}_*)$. We see a great match between theoretical prediction and simulation results. Here, $d = 600$, $m=300$, $n= 600$ (prompt size), $\rho=0.1$, $\eta = 0.2$, $r = 0.1$ (interference parameter). The simulations are averaged over 10 realizations.

Theorems & Definitions (12)

  • Theorem 4.1
  • Theorem 4.2
  • Remark 4.1
  • Proposition 4.3
  • Proposition 5.1
  • Proposition 6.1
  • Lemma A.1
  • Theorem B.1
  • Proposition B.2
  • Lemma C.1
  • ...and 2 more