Table of Contents
Fetching ...

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Ivan Ilin, Peter Richtarik

TL;DR

Thanos tackles the challenge of compressing large language models by introducing a block-wise pruning algorithm with adaptive masks that support unstructured, structured, and semi-structured sparsity. It builds on a data-aware pruning paradigm, deriving a tractable approach that greedily prunes weights block-by-block while updating remaining weights to minimize the input-output discrepancy. A key novelty is the combination of a Wanda-style pruning metric with flexible block handling, outlier-row treatment for structured pruning, and support for $n:m$ semi-structured sparsity, enabling hardware-friendly acceleration on modern GPUs. Empirical results across OPT and LLaMA family models show that Thanos outperforms baseline pruning methods in structured and semi-structured settings and maintains competitive perplexity and zero-shot performance under high sparsity, all while offering an open-source implementation. These findings suggest Thanos as a practical tool for deploying large models in resource-constrained environments, with strong performance especially in structured pruning scenarios.

Abstract

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

TL;DR

Thanos tackles the challenge of compressing large language models by introducing a block-wise pruning algorithm with adaptive masks that support unstructured, structured, and semi-structured sparsity. It builds on a data-aware pruning paradigm, deriving a tractable approach that greedily prunes weights block-by-block while updating remaining weights to minimize the input-output discrepancy. A key novelty is the combination of a Wanda-style pruning metric with flexible block handling, outlier-row treatment for structured pruning, and support for semi-structured sparsity, enabling hardware-friendly acceleration on modern GPUs. Empirical results across OPT and LLaMA family models show that Thanos outperforms baseline pruning methods in structured and semi-structured settings and maintains competitive perplexity and zero-shot performance under high sparsity, all while offering an open-source implementation. These findings suggest Thanos as a practical tool for deploying large models in resource-constrained environments, with strong performance especially in structured pruning scenarios.

Abstract

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

Paper Structure

This paper contains 68 sections, 1 theorem, 104 equations, 9 figures, 17 tables, 9 algorithms.

Key Result

Lemma 5.1

The optimal solution to the problem (eq:row_wise_problem_start_appendix) satisfy:

Figures (9)

  • Figure 1: Evaluation of Wikitext2 perplexity pruned by different methods: Wanda, SparseGPT and Thanos (our). (a) Unstructured pruning (OPT-125M), (b) Structured pruning (LLaMA-3 8B).
  • Figure 2: Demonstration of Thanos mask selection algorithm. Entries of $M$ that are equal to one are represented by orange circles, while zeros are depicted with green circles.
  • Figure 3: Demonstration of main steps of structured pruning by Thanos algorithm.
  • Figure 4: Demonstration of SparseGPT mask selection algorithm. Entries of $M$ that are equal to one are represented by orange circles, while zeros are depicted with green circles. In the beginning (a), we prune the first block of parameters. To do this, we compute the mask for the local block of size $B$. The local block should be $p\%$ sparse, so the local block mask will be $p\%$ dense. On the second step (b) we compute the mask for the second local block of the same size with the same local sparsity $p\%$ and so on.
  • Figure 5: Demonstration of off-block parameters communication for SparseGPT. Entries of $W$ marked with orange circles indicate parameters designated for removal, while those marked with green circles represent weights that require updating. When we prune the parameter, all parameters on the right part of the sequence are updated.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Lemma 5.1
  • proof