Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
Ivan Ilin, Peter Richtarik
TL;DR
Thanos tackles the challenge of compressing large language models by introducing a block-wise pruning algorithm with adaptive masks that support unstructured, structured, and semi-structured sparsity. It builds on a data-aware pruning paradigm, deriving a tractable approach that greedily prunes weights block-by-block while updating remaining weights to minimize the input-output discrepancy. A key novelty is the combination of a Wanda-style pruning metric with flexible block handling, outlier-row treatment for structured pruning, and support for $n:m$ semi-structured sparsity, enabling hardware-friendly acceleration on modern GPUs. Empirical results across OPT and LLaMA family models show that Thanos outperforms baseline pruning methods in structured and semi-structured settings and maintains competitive perplexity and zero-shot performance under high sparsity, all while offering an open-source implementation. These findings suggest Thanos as a practical tool for deploying large models in resource-constrained environments, with strong performance especially in structured pruning scenarios.
Abstract
This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.
