LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

David Hoffmann; Kailash Budhathoki; Matthaeus Kleindessner

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner

TL;DR

The paper tackles the rising cost of inference in large language models by introducing graph-based pruning methods that use centrality measures to identify redundant neurons. It formulates MLP-Rank, which represents MLPs as weighted directed acyclic graphs and applies a modified Weighted PageRank to derive per-neuron importance, enabling uniform structured pruning; this idea is extended to decoder-only transformers as LLM-Rank by chaining FFNs and applying component-wise WPR. Empirical results show that MLP-Rank achieves about 6.09% higher accuracy retention on average across several MLPs, while LLM-Rank attains roughly 13.42% higher accuracy retention than strong baselines on six Open-LLaMa-3b-v2 benchmarks, particularly at lower sparsities. The approach promises real-world speedups without specialized hardware and points to future work in extending pruning to attention components and validating across diverse model families. Overall, the work demonstrates that graph-theoretic centrality can be a powerful, hardware-friendly tool for pruning large neural networks.

Abstract

The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines. Code is available at https://github.com/amazon-science/llm-rank-pruning.

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

TL;DR

Abstract

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)