Table of Contents
Fetching ...

EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models

Xingrun Xing, Zheng Liu, Shitao Xiao, Boyan Gao, Yiming Liang, Wanpeng Zhang, Haokun Lin, Guoqi Li, Jiajun Zhang

TL;DR

EfficientLLM tackles edge-language-model inefficiency by introducing pruning-aware pretraining, which continuously prunes minimal parameter groups during pretraining to create architecture-agnostic sub-networks. It combines saliency-driven architecture search with second-order weight updates to scale LLM pruning within pretraining, bridging the gap between post-training compression and direct pretraining. Experiments show that EfficientLLM achieves state-of-the-art or competitive performance for 100M–1B parameter models on common-sense and reasoning benchmarks, outperforming existing edge LLMs and several pruning baselines with less pretraining data. This approach provides a data-efficient, hardware-friendly path to practical edge LLMs and will be released as open-source software.

Abstract

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with $100M \sim 1B$ parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at https://github.com/Xingrun-Xing2/EfficientLLM.

EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models

TL;DR

EfficientLLM tackles edge-language-model inefficiency by introducing pruning-aware pretraining, which continuously prunes minimal parameter groups during pretraining to create architecture-agnostic sub-networks. It combines saliency-driven architecture search with second-order weight updates to scale LLM pruning within pretraining, bridging the gap between post-training compression and direct pretraining. Experiments show that EfficientLLM achieves state-of-the-art or competitive performance for 100M–1B parameter models on common-sense and reasoning benchmarks, outperforming existing edge LLMs and several pruning baselines with less pretraining data. This approach provides a data-efficient, hardware-friendly path to practical edge LLMs and will be released as open-source software.

Abstract

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at https://github.com/Xingrun-Xing2/EfficientLLM.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An overview of pruning-aware pretraining. (a) Training loop includes the joint saliency detection and weight optimizing, pruning type selection from pruning space, and second-order weight updating. (b) Traditional post-training pruning can be embeded in the training loop to scale up. (c) Continuous model size compression in pretraining.
  • Figure 2: Performance of Pruning-Aware Pretraining. By scaling up LLM-Pruner in pretraining, performance of the source model is retained even if the pruning rate more than 70%.
  • Figure 3: Three basic pruning typies in the pruning space. We plot all the weight metrics with shape $[D_{input}, D_{output}]$. In backpropagation (in orange), the saliency of the output layer group (in blue) is calculated according to Eq. \ref{['e9']}.
  • Figure 4: Win rate of EfficientLLM in the instruction tuning task.
  • Figure 5: Scalability of pruning-aware pretraining.
  • ...and 3 more figures