Table of Contents
Fetching ...

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar

TL;DR

This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM), and develops a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs.

Abstract

A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

TL;DR

This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM), and develops a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs.

Abstract

A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

Paper Structure

This paper contains 29 sections, 6 theorems, 62 equations, 3 figures, 11 tables, 1 algorithm.

Key Result

Theorem 3.3

Let $\hat{{\bm{\theta}}}$ and ${\bm{\theta}}^{*}$ be as defined in eq:kd-erm-theta. Define $f^{\bm{\theta}}: \mathcal{V}^T \to [0, M]$ by $f^{\bm{\theta}}({\mathbf{x}}) = \ell^\omega(\mathbf{x}; {\bm{\theta}}), ~\forall {\mathbf{x}} \in \mathcal{V}^T, {\bm{\theta}} \in \Theta.$ Then, under Assumptio where $\mathsf{D}_{\rm TV}$ is TV distance, $V_N(f^{{\bm{\theta}}}) = \frac{1}{N(N-1)}\sum_{1 \leq

Figures (3)

  • Figure 1: An overview of small model aided large model training (SALT) pre-training. SALT utilizes an SLM in two ways to improve the pre-training of LLM: ① To perform KD with SLM as teacher in the early phase of LLM pre-training; and ② To obtain a valuable subset of pre-training corpora to be utilized during the KD.
  • Figure 2: Fraction of correct next-token predictions for various LMs during training, on a subset of the Pile training set.
  • Figure 3: Log perplexity for different models during their pre-training, as measured on a subset of the Pile training set.

Theorems & Definitions (14)

  • Remark 3.2
  • Theorem 3.3: Informal
  • Theorem 3.5
  • Remark 3.6: Dependence of $C,\{V_t\}$ on $T$
  • Remark 3.7
  • Proposition B.2
  • proof : Proof of Proposition \ref{['prop:esr']}
  • Theorem B.3: Formal version of Theorem \ref{['thm:esr']}
  • proof : Proof of Theorem \ref{['thm:esr-appen']}
  • Lemma B.4
  • ...and 4 more