Table of Contents
Fetching ...

LESA: Learnable LLM Layer Scaling-Up

Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, Hai Zhao

TL;DR

LESA addresses the prohibitive cost of training massive LLMs by replacing heuristic depth-scaling with a learnable approach. It uses SVD to reveal inter-layer patterns and trains a neural predictor to generate intermediate layers that are inserted between adjacent layers, yielding better initialization and faster convergence during continual pre-training. Empirical results show LESA outperforms baselines like LLaMA Pro and SOLAR across model sizes, domains, and knowledge tasks, with lower training cost. The approach offers a practical path to more efficient depth scaling-up and provides insights into inter-layer parameter patterns that could guide future model design.

Abstract

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.

LESA: Learnable LLM Layer Scaling-Up

TL;DR

LESA addresses the prohibitive cost of training massive LLMs by replacing heuristic depth-scaling with a learnable approach. It uses SVD to reveal inter-layer patterns and trains a neural predictor to generate intermediate layers that are inserted between adjacent layers, yielding better initialization and faster convergence during continual pre-training. Empirical results show LESA outperforms baselines like LLaMA Pro and SOLAR across model sizes, domains, and knowledge tasks, with lower training cost. The approach offers a practical path to more efficient depth scaling-up and provides insights into inter-layer parameter patterns that could guide future model design.

Abstract

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.

Paper Structure

This paper contains 27 sections, 3 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Existing depth scaling-up methods can be categorized into two types: "Interpolation" and "Stack". LLaMA Pro and SOLAR can be seen as specific examples of these two types. Layers with the same color represent identical parameters, and the dashed boxes indicate those obtained through duplication.
  • Figure 2: The inter-layer continuity pattern exhibited by the gate_proj matrix of Llama3-8B in the SVD space. The numbers represent the layer indices.
  • Figure 3: Overview of the proposed LESA . We first extract the weight matrices from the MLP and self-attention layers. Next, we apply SVD and train a neural network to predict the intermediate layers. Finally, we reconstruct the expanded LLM.
  • Figure 4: The continual pre-training loss curves of models expanded by different methods. LESA starts with a lower initial loss, indicating a better initialization. It stabilizes after 2k steps, reaching the same convergence level as LLaMA Pro after 5k steps, and converges much faster than SOLAR, achieving the same loss with less than half the training cost.
  • Figure 5: The continual pre-training loss curves without SVD, compared to the main experiment, show that with SVD, the model's initial loss and the final converged loss are both slightly lower.
  • ...and 3 more figures