Table of Contents
Fetching ...

Compressing Large Language Models with Automated Sub-Network Search

Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein

TL;DR

The paper tackles the scalability challenge of large language models by proposing an automated sub-network search via two-stage neural architecture search to identify Pareto-optimal sparse architectures that balance accuracy and on-device latency. It introduces a joint search space for decoder-only Transformers, a calibrated sampling strategy, importance-based sorting, and integration with parameter-efficient fine-tuning (LoRA) and in-place knowledge distillation to efficiently explore many architectures within a single training run. The method yields Pareto-optimal sub-networks that consistently outperform structural pruning baselines and smaller models across 11 downstream tasks, delivering significant latency reductions (up to about 22%) while preserving or improving accuracy. This approach enables more practical deployment of large models on resource-constrained devices and provides a scalable, automated pipeline for LLM compression that reduces cost and energy use. The work advances automated model optimization by combining NAS, importance-driven sub-network selection, and PEFT to generate adaptable, hardware-aware AI systems.

Abstract

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.

Compressing Large Language Models with Automated Sub-Network Search

TL;DR

The paper tackles the scalability challenge of large language models by proposing an automated sub-network search via two-stage neural architecture search to identify Pareto-optimal sparse architectures that balance accuracy and on-device latency. It introduces a joint search space for decoder-only Transformers, a calibrated sampling strategy, importance-based sorting, and integration with parameter-efficient fine-tuning (LoRA) and in-place knowledge distillation to efficiently explore many architectures within a single training run. The method yields Pareto-optimal sub-networks that consistently outperform structural pruning baselines and smaller models across 11 downstream tasks, delivering significant latency reductions (up to about 22%) while preserving or improving accuracy. This approach enables more practical deployment of large models on resource-constrained devices and provides a scalable, automated pipeline for LLM compression that reduces cost and energy use. The work advances automated model optimization by combining NAS, importance-driven sub-network selection, and PEFT to generate adaptable, hardware-aware AI systems.

Abstract

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.
Paper Structure (31 sections, 12 equations, 8 figures, 3 tables)

This paper contains 31 sections, 12 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of Average Accuracy (on commonsense reasoning tasks) v/s Latency and Parameter Pareto-Fronts for different pruning methods on Llama-3.1-8B
  • Figure 2: Block importance scheme
  • Figure 3: Parameter count distribution for sub-networks derived from the Joint-Space. (a) sampled using random-sampling scheme. (b) sampled according to grid-sampling scheme. We can see that sampling randomly tends to over-sample tiny models that are not capable of achieving reasonable performance.
  • Figure 4: Illustration of importance sorting for a simple 1-layer FFN with 3 units. Reshuffling the hidden units does not change the final output. After sorting, we extract a sub-network with one 2 hidden units.
  • Figure 5: Comparison of Accuracy v/s Latency Pareto-Fronts for Different Architecture Sampling Scheme
  • ...and 3 more figures