Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
TL;DR
This work tackles the high cost of pretraining large language models by asking whether small, pretrained LLMs can catalyze the training of much larger models. It introduces Late-to-Early Training (LET), a paradigm with two mechanisms: late-to-early-step learning and late-to-early-layer learning, guided by a decaying projection loss that aligns early layers of the target model to late layers of a smaller teacher. Through extensive experiments on 1.4B and 7B parameter models trained on The Pile, LET delivers both faster convergence and stronger downstream performance, achieving up to 1.6× speedup while improving accuracy by about 5% on downstream tasks, even when the teacher has an order of magnitude fewer parameters. Ablation studies show that aligning the teacher’s final layer with the target’s early layers (L2E) and tuning the alignment weight to an optimal level around 0.1 yields robust gains, and LET-1.4B can surpass Baseline-3B in performance. Overall, LET provides a practical, resource-efficient pathway to leverage existing pretrained assets for building stronger, faster LLMs.
Abstract
As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
