Table of Contents
Fetching ...

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie

TL;DR

This work tackles the high cost of pretraining large language models by asking whether small, pretrained LLMs can catalyze the training of much larger models. It introduces Late-to-Early Training (LET), a paradigm with two mechanisms: late-to-early-step learning and late-to-early-layer learning, guided by a decaying projection loss that aligns early layers of the target model to late layers of a smaller teacher. Through extensive experiments on 1.4B and 7B parameter models trained on The Pile, LET delivers both faster convergence and stronger downstream performance, achieving up to 1.6× speedup while improving accuracy by about 5% on downstream tasks, even when the teacher has an order of magnitude fewer parameters. Ablation studies show that aligning the teacher’s final layer with the target’s early layers (L2E) and tuning the alignment weight to an optimal level around 0.1 yields robust gains, and LET-1.4B can surpass Baseline-3B in performance. Overall, LET provides a practical, resource-efficient pathway to leverage existing pretrained assets for building stronger, faster LLMs.

Abstract

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

TL;DR

This work tackles the high cost of pretraining large language models by asking whether small, pretrained LLMs can catalyze the training of much larger models. It introduces Late-to-Early Training (LET), a paradigm with two mechanisms: late-to-early-step learning and late-to-early-layer learning, guided by a decaying projection loss that aligns early layers of the target model to late layers of a smaller teacher. Through extensive experiments on 1.4B and 7B parameter models trained on The Pile, LET delivers both faster convergence and stronger downstream performance, achieving up to 1.6× speedup while improving accuracy by about 5% on downstream tasks, even when the teacher has an order of magnitude fewer parameters. Ablation studies show that aligning the teacher’s final layer with the target’s early layers (L2E) and tuning the alignment weight to an optimal level around 0.1 yields robust gains, and LET-1.4B can surpass Baseline-3B in performance. Overall, LET provides a practical, resource-efficient pathway to leverage existing pretrained assets for building stronger, faster LLMs.

Abstract

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6 speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10 fewer parameters than the target model.
Paper Structure (44 sections, 19 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 19 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of Average Downstream Task Performance: LET vs. Baseline (Standard Training) on 1.4B and 7B Models. LET models are trained under our proposed LET paradigm, whereas the baseline models utilize standard causal language modeling. Remarkably, LET delivers significant performance gains, even when aligned with a model 10$\times$ smaller than the target model.
  • Figure 2: Language modeling performance of LET across three different vocabulary settings. We evaluate the perplexity of models trained with different vocabulary: SmolLM, OPT, and Pythia. For fair comparison gao2020pile, each subplot uses the same vocabulary. The results demonstrate that LET consistently achieves lower perplexity across all three settings.
  • Figure 3: Comparison of six layer-wise alignment strategies on average downstream task performance in one-shot evaluation. The proposed LET paradigm, corresponding to L2E, achieves the highest average performance across all downstream tasks, outperforming all alternative strategies.
  • Figure 4: Comparison of six layer-wise alignment strategies on language modeling performance, measured as test perplexity on the test split of The Pile dataset. Both M2E and L2E maintain robust performance throughout training, with L2E yielding the lowest final perplexity among all strategies.
  • Figure 5: Average downstream task performance (left) and test perplexity on the The Pile dataset (right) evaluated under different $\lambda$ values: $0.01$, $0.1$, $0.3$, $1.0$, and $3.0$. "Baseline" denotes training with standard causal language modeling, whereas all other configurations employ the proposed LET paradigm with different $\lambda$.
  • ...and 8 more figures