Table of Contents
Fetching ...

DarwinLM: Evolutionary Structured Pruning of Large Language Models

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh

TL;DR

DarwinLM tackles the challenge of deploying large language models under practical compute constraints by introducing a training-aware structured pruning framework. It fuses second-order pruning with an evolutionary search over non-uniform sparsity allocations, augmented by a lightweight, multi-step finetuning process to evaluate offspring. The approach yields state-of-the-art sparse models across Llama-2-7B, Llama-3.1-8B, and Qwen-2.5-14B-Instruct, achieving higher accuracy with substantially less post-training data than prior methods. This yields hardware-agnostic speedups and practical, scalable deployment for real-world applications, with notable data-efficiency and performance benefits during downstream tasks.

Abstract

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM

DarwinLM: Evolutionary Structured Pruning of Large Language Models

TL;DR

DarwinLM tackles the challenge of deploying large language models under practical compute constraints by introducing a training-aware structured pruning framework. It fuses second-order pruning with an evolutionary search over non-uniform sparsity allocations, augmented by a lightweight, multi-step finetuning process to evaluate offspring. The approach yields state-of-the-art sparse models across Llama-2-7B, Llama-3.1-8B, and Qwen-2.5-14B-Instruct, achieving higher accuracy with substantially less post-training data than prior methods. This yields hardware-agnostic speedups and practical, scalable deployment for real-world applications, with notable data-efficiency and performance benefits during downstream tasks.

Abstract

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Visual illustration of DarwinLM pipeline. 1) generate sparsity level database with different sparsities by second-order structured pruning. 2) evolutionary search with training-aware selection based on the sparsity level database.
  • Figure 2: Motivation of the training-aware selection. The Y-axis depicts the KL-Divergence of the model after full post-training while x-axis is the KL-Divergence after small-scale data training. The results indicate that our training-aware selection can select the best offspring for large-scale training.
  • Figure 3: Performance comparison of DarwinLM and ShearedLlama with different training token numbers. DarwinLM achieves better performance than ShearedLlama on all training token number settings.
  • Figure 4: Comparison of DarwinLM and other one-shot methods that remove modules entirely. Our method consistently outperforms across all sparsity levels, demonstrating the effectiveness of our finer-grained structured pruning approach. Note that the y-axis is log-scaled.
  • Figure 5: Post-training comparison of ShearedLlama and DarwinLM on each benchmark.