Table of Contents
Fetching ...

Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

Habib Hajimolahoseini, Mohammad Hassanpour, Foozhan Ataiefard, Boxing Chen, Yang Liu

TL;DR

The paper addresses the high cost of scaling large language models by introducing Progressive Low Rank Decomposition (PLRD), a method that progressively factorizes transformer weight matrices to obtain smaller models from a single pre-trained foundation without retraining from scratch. By expressing weight matrices as $W' = W_0 W_1$ and applying iterative, rank-reduced decompositions with continual pretraining between steps, PLRD yields a continuum of model sizes (e.g., 3.1B, 3.3B) from base models like Mistral-v0.1-7B and LLaMa2-7B while using only about $1$B tokens for training. Empirical results show PLRD achieves comparable zero-shot performance to models trained from scratch on benchmarks such as LogiQA, BoolQ, MMLU, and WinoGrande, with CPU inference speeds within ~3% of baseline models. The work implies PLRD can democratize access to efficient LLMs by enabling multiple sizes from a single foundation with minimal additional training and without bespoke training recipes, though further validation on larger architectures is needed.

Abstract

This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

TL;DR

The paper addresses the high cost of scaling large language models by introducing Progressive Low Rank Decomposition (PLRD), a method that progressively factorizes transformer weight matrices to obtain smaller models from a single pre-trained foundation without retraining from scratch. By expressing weight matrices as and applying iterative, rank-reduced decompositions with continual pretraining between steps, PLRD yields a continuum of model sizes (e.g., 3.1B, 3.3B) from base models like Mistral-v0.1-7B and LLaMa2-7B while using only about B tokens for training. Empirical results show PLRD achieves comparable zero-shot performance to models trained from scratch on benchmarks such as LogiQA, BoolQ, MMLU, and WinoGrande, with CPU inference speeds within ~3% of baseline models. The work implies PLRD can democratize access to efficient LLMs by enabling multiple sizes from a single foundation with minimal additional training and without bespoke training recipes, though further validation on larger architectures is needed.

Abstract

This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.
Paper Structure (15 sections, 6 equations, 2 figures, 9 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: Training Efficiency Comparison of 3B Models. PLRD-Mistral-3.1B and PLRD-LLaMa2-3.3B models achieve similar average performance on downstream benchmarks while they are trained with 0.1% of number of tokens used in pre-training from scratch.
  • Figure 2: Progressive Low-rank Decomposition of Mistral-v0.1 (left) and LLaMa2 (right). This figure shows the steps of compression and training for each of these two models. In each step of compression, the accuracy of the model was recovered with continual pretraining on 250M tokens. The final models are trained on 1B token in total.