Table of Contents
Fetching ...

Estimating the Effects of Sample Training Orders for Large Language Models without Retraining

Hao Yang, Haoxuan Li, Mengyue Yang, Xu Chen, Mingming Gong

TL;DR

FUT addresses the prohibitive cost of evaluating LLM performance under different training sample orders by approximating Adam updates with first- and second-order Taylor expansions and compressing the required information with Johnson-Lindenstrauss projections. It yields a retraining-free framework to estimate parameter trajectories for arbitrary sample orders, enabling curriculum design and memorization/generalization analyses. Empirical results on a 0.64B-parameter LLaMA-like model on WikiText-103 show that FUT closely reproduces retraining outcomes and offers substantial speedups when exploring curricula. The framework provides a practical tool for data-centric optimization and analysis of LLM training dynamics.

Abstract

The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs -- we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum proposals with estimated model performances, enabling more informed sample scheduling. (2) LLMs' memorization and generalization effect analysis -- we use our retraining-free framework to estimate how the positions of training samples influence LLMs' capacity for memorization and generalization. We conduct extensive experiments to validate the effectiveness of our retraining-free framework in reproducing the true model performances, and further demonstrate its potential in optimizing LLM training curricula and analyzing the memorization and generalization effects of LLMs.

Estimating the Effects of Sample Training Orders for Large Language Models without Retraining

TL;DR

FUT addresses the prohibitive cost of evaluating LLM performance under different training sample orders by approximating Adam updates with first- and second-order Taylor expansions and compressing the required information with Johnson-Lindenstrauss projections. It yields a retraining-free framework to estimate parameter trajectories for arbitrary sample orders, enabling curriculum design and memorization/generalization analyses. Empirical results on a 0.64B-parameter LLaMA-like model on WikiText-103 show that FUT closely reproduces retraining outcomes and offers substantial speedups when exploring curricula. The framework provides a practical tool for data-centric optimization and analysis of LLM training dynamics.

Abstract

The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs -- we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum proposals with estimated model performances, enabling more informed sample scheduling. (2) LLMs' memorization and generalization effect analysis -- we use our retraining-free framework to estimate how the positions of training samples influence LLMs' capacity for memorization and generalization. We conduct extensive experiments to validate the effectiveness of our retraining-free framework in reproducing the true model performances, and further demonstrate its potential in optimizing LLM training curricula and analyzing the memorization and generalization effects of LLMs.

Paper Structure

This paper contains 32 sections, 1 theorem, 22 equations, 8 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $0 < \epsilon < 1$ and let $X = \{x_1, x_2, \dots, x_n\} \subset \mathbb{R}^d$ be a set of $n$ vectors. Then there exists a linear mapping $f: \mathbb{R}^d \rightarrow \mathbb{R}^k$, where $k = \mathcal{O}(\epsilon^{-2} \log n)$, such that for all $x_i, x_j \in X$,

Figures (8)

  • Figure 1: Overview of the FUT framework. FUT operates in three stages: Stage 1: Compute the reference trajectory $\Theta = \{\theta_t\}_{t=0}^T$ using a fixed data order $r$. Stage 2: Store update and gradient terms for all $(\theta_t, B_{l_t})$ pairs, compressing them via random projection. Stage 3: Estimate trajectories $\{\gamma_t^{k_i}\}_{t=0}^T$ under permuted data orders $\{k_i\}_{i=1}^N$ using first-order Taylor expansion based on stored terms. A toy example along the dashed line illustrates: ① retrieving stored terms for expansion, and ② updating parameters along a permuted order.
  • Figure 1: Estimation accuracy (AbsDiff) with different batch sizes.
  • Figure 2: Time cost comparison.
  • Figure 3: Memorization effects. Heatmaps in (a) and (b) are estimated by our FUT and FUT++ methods, respectively. Heatmap in (c) represents the true memorization effect obtained by retraining.
  • Figure 4: The generalization effect of batch $B_i$ on dataset $D$, with $\text{sim}(B_i,D) >= \tau$.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 1: Johnson–Lindenstrauss Theorem