LESA: Learnable LLM Layer Scaling-Up
Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, Hai Zhao
TL;DR
LESA addresses the prohibitive cost of training massive LLMs by replacing heuristic depth-scaling with a learnable approach. It uses SVD to reveal inter-layer patterns and trains a neural predictor to generate intermediate layers that are inserted between adjacent layers, yielding better initialization and faster convergence during continual pre-training. Empirical results show LESA outperforms baselines like LLaMA Pro and SOLAR across model sizes, domains, and knowledge tasks, with lower training cost. The approach offers a practical path to more efficient depth scaling-up and provides insights into inter-layer parameter patterns that could guide future model design.
Abstract
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.
