Table of Contents
Fetching ...

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner

TL;DR

The paper tackles the rising cost of inference in large language models by introducing graph-based pruning methods that use centrality measures to identify redundant neurons. It formulates MLP-Rank, which represents MLPs as weighted directed acyclic graphs and applies a modified Weighted PageRank to derive per-neuron importance, enabling uniform structured pruning; this idea is extended to decoder-only transformers as LLM-Rank by chaining FFNs and applying component-wise WPR. Empirical results show that MLP-Rank achieves about 6.09% higher accuracy retention on average across several MLPs, while LLM-Rank attains roughly 13.42% higher accuracy retention than strong baselines on six Open-LLaMa-3b-v2 benchmarks, particularly at lower sparsities. The approach promises real-world speedups without specialized hardware and points to future work in extending pruning to attention components and validating across diverse model families. Overall, the work demonstrates that graph-theoretic centrality can be a powerful, hardware-friendly tool for pruning large neural networks.

Abstract

The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines. Code is available at https://github.com/amazon-science/llm-rank-pruning.

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

TL;DR

The paper tackles the rising cost of inference in large language models by introducing graph-based pruning methods that use centrality measures to identify redundant neurons. It formulates MLP-Rank, which represents MLPs as weighted directed acyclic graphs and applies a modified Weighted PageRank to derive per-neuron importance, enabling uniform structured pruning; this idea is extended to decoder-only transformers as LLM-Rank by chaining FFNs and applying component-wise WPR. Empirical results show that MLP-Rank achieves about 6.09% higher accuracy retention on average across several MLPs, while LLM-Rank attains roughly 13.42% higher accuracy retention than strong baselines on six Open-LLaMa-3b-v2 benchmarks, particularly at lower sparsities. The approach promises real-world speedups without specialized hardware and points to future work in extending pruning to attention components and validating across diverse model families. Overall, the work demonstrates that graph-theoretic centrality can be a powerful, hardware-friendly tool for pruning large neural networks.

Abstract

The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines. Code is available at https://github.com/amazon-science/llm-rank-pruning.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of the proposed graph representation extraction from the LLM FFNs. Part (1) shows the full decoder-only transformer model from which the position-wise feed-forward networks are extracted and chained in part (2). The Graph representation of the resulting MLP network to which WPR is applied is shown in part (3). Note that the grey arrows represent the mapping from original architecture components, to the chained FFN, to the graph representation.
  • Figure 2: Mean zero-shot accuracy for each method across all five sparsity ratios.
  • Figure 3: Example of a two-layer MLP graph representation hoffmann2024mlprank.
  • Figure 4: Weight matrix of the graph representation hoffmann2024mlprank.
  • Figure 5: Exemplary illustration for the creation of an LLM graph representation including key-, value-, and query matrices. Note that the grey arrows represent the mapping from original architecture components, to the chained FFN, to the graph representation.