Table of Contents
Fetching ...

A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

Zhihao Wang, Shiyu Liu, Jianheng Huang, Zheng Wang, Yixuan Liao, Xiaoxin Chen, Junfeng Yao, Jinsong Su

TL;DR

This work finds that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs, and proposes a learning rate path switching training paradigm.

Abstract

Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.

A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

TL;DR

This work finds that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs, and proposes a learning rate path switching training paradigm.

Abstract

Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.
Paper Structure (32 sections, 1 equation, 6 figures, 13 tables)

This paper contains 32 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: The learning rate curves of cosine learning rate schedule under PTFS, CPT and our paradigm, all of which are used to update four versions of LLMs. Here, different color curves represent different version updates of LLMs.
  • Figure 2: The comparison of different training paradigms. "APPL" ($\downarrow$) denotes the average perplexity of LLMs across different versions, "Relative Cost" ($\downarrow$) is the ratio of the total training steps across different versions of each paradigm to the total training steps of PTFS. The lower left corner achieves the best trade-off.
  • Figure 3: The learning rate curves of cosine cos, Knee Knee, and multi-step deepseek learning rate schedules.
  • Figure 4: The effect of learning rate adjustment in the first stage. In the first stage, we vary the cosine cycle length as 10K, 20K, 30K, 40K and +$\infty$ steps, respectively, where the checkpoints at the 10K-th steps are selected as the initialization ones for the subsequent 10K-steps continual pre-training. "($\cdot$,$\cdot$)" indicates the PPLs of the initialization checkpoint and corresponding updated LLM.
  • Figure 5: The effect of learning rate adjustment in the second stage. In the first stage, we directly use the maximal learning rate after warm-up. During the second stage, we try cosine cycle length with 10K, 20K, 30K, 40K and +$\infty$ steps, respectively, where the PPLs of LLMs at the 20K-th steps are compared.
  • ...and 1 more figures