Table of Contents
Fetching ...

How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve

Waleed Reda, Abhinav Jangda, Krishna Chintalapudi

TL;DR

This work tackles the problem of task-aware parameter sufficiency for deploying LLMs in narrow domains. It introduces LLM-Sieve, which uses output-aligned non-orthogonal projections and a genetic-algorithm driven adaptive pruning to identify a task-specific subnetwork that preserves end-to-end performance within a tolerance ${\epsilon}$. The approach achieves 25–75% parameter reduction (and up to ~90% memory savings with quantization) across models from 3.8B to 70B parameters, significantly outperforming prior pruning methods and revealing bottleneck matrices that concentrate critical knowledge. LLM-Sieve remains compatible with LoRA fine-tuning and 8-bit quantization, enabling efficient deployment and providing insights into knowledge organization that could inform future architectural design.

Abstract

As Large Language Models (LLMs) are increasingly deployed for narrow tasks in resource-constrained settings, a central question arises: how much of an LLM is truly necessary for a given task? We present LLM-Sieve, a framework that prunes LLMs down to the minimal parameter subset needed to preserve task performance. Our approach introduces two innovations: (i) output-aligned non-orthogonal projections, which yield more faithful low-rank approximations than traditional PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via a Genetic Algorithm, which automatically discovers matrix-specific pruning levels and exposes the uneven distribution of task-relevant knowledge. Across models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond efficiency, our framework reveals bottleneck matrices that concentrate critical knowledge, suggesting architectural implications for future LLM design. LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization, enabling both efficient deployment and deeper understanding of knowledge organization in LLMs.

How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve

TL;DR

This work tackles the problem of task-aware parameter sufficiency for deploying LLMs in narrow domains. It introduces LLM-Sieve, which uses output-aligned non-orthogonal projections and a genetic-algorithm driven adaptive pruning to identify a task-specific subnetwork that preserves end-to-end performance within a tolerance . The approach achieves 25–75% parameter reduction (and up to ~90% memory savings with quantization) across models from 3.8B to 70B parameters, significantly outperforming prior pruning methods and revealing bottleneck matrices that concentrate critical knowledge. LLM-Sieve remains compatible with LoRA fine-tuning and 8-bit quantization, enabling efficient deployment and providing insights into knowledge organization that could inform future architectural design.

Abstract

As Large Language Models (LLMs) are increasingly deployed for narrow tasks in resource-constrained settings, a central question arises: how much of an LLM is truly necessary for a given task? We present LLM-Sieve, a framework that prunes LLMs down to the minimal parameter subset needed to preserve task performance. Our approach introduces two innovations: (i) output-aligned non-orthogonal projections, which yield more faithful low-rank approximations than traditional PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via a Genetic Algorithm, which automatically discovers matrix-specific pruning levels and exposes the uneven distribution of task-relevant knowledge. Across models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond efficiency, our framework reveals bottleneck matrices that concentrate critical knowledge, suggesting architectural implications for future LLM design. LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization, enabling both efficient deployment and deeper understanding of knowledge organization in LLMs.

Paper Structure

This paper contains 34 sections, 9 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: End-to-end accuracy vs. percentage of parameters pruned for a Generic-RAG multi-step QA task on Phi-3-mini (3.8B), LLaMA-3.1-8B, and LLaMA-3.1-70B, with and without 8-bit quantization ("-Q"). Accuracy remains stable until 25--60% pruning, then drops sharply, revealing redundancy followed by task-critical bottlenecks.
  • Figure 2: Each matrix multiplication in an LLM is approximated in LLM-Sieve.
  • Figure 3: Low-rank approximations used in LLM-Sieve pruning.
  • Figure 4: Intuition behind LLM-Sieve projections.
  • Figure 5: Calibration step in LLM-Sieve.
  • ...and 13 more figures