Table of Contents
Fetching ...

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

Harry Dong, Beidi Chen, Yuejie Chi

TL;DR

The paper tackles the high computational cost of large language models by exploiting activation sparsity in feedforward blocks through a training-free, prompt-driven method named GRIFFIN. Centered on the flocking phenomenon—shared relative activation patterns within a sequence—GRIFFIN selects a small set of FF neurons during the prompt and uses them for the entire generation, achieving substantial latency reductions while preserving accuracy across multiple models and non-ReLU activations. It provides theoretical framing via FF block compression and demonstrates empirical results showing up to $1.29\times$ latency improvements at 50% FF sparsity with minimal performance loss on generation and classification tasks. The approach is hardware-friendly, model-agnostic, and easily deployable, broadening access to efficient inference for diverse LLMs.

Abstract

With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29$\times$ and 1.25$\times$ speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at https://github.com/hdong920/GRIFFIN.

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

TL;DR

The paper tackles the high computational cost of large language models by exploiting activation sparsity in feedforward blocks through a training-free, prompt-driven method named GRIFFIN. Centered on the flocking phenomenon—shared relative activation patterns within a sequence—GRIFFIN selects a small set of FF neurons during the prompt and uses them for the entire generation, achieving substantial latency reductions while preserving accuracy across multiple models and non-ReLU activations. It provides theoretical framing via FF block compression and demonstrates empirical results showing up to latency improvements at 50% FF sparsity with minimal performance loss on generation and classification tasks. The approach is hardware-friendly, model-agnostic, and easily deployable, broadening access to efficient inference for diverse LLMs.

Abstract

With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29 and 1.25 speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at https://github.com/hdong920/GRIFFIN.
Paper Structure (23 sections, 6 equations, 12 figures, 5 tables)

This paper contains 23 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Relative FF activation magnitudes of the first 512 features and tokens across a sequence from PG-19 raecompressive2019gao2020pile in layer 10 of Llama 2 7B (left) and Gemma 7B (right). These heatmaps show flocking, where relative activation magnitudes are shared within a sequence, illustrated with the distinct dark vertical streaks. More examples in Appendix \ref{['app:magnitude_examples']}.
  • Figure 2: Average Jaccard similarity between WikiText samples' top FF neuron activations in Llama 2 7B (left) and Gemma 7B (right). Higher values indicate greater similarity.
  • Figure 3: GRIFFIN overview. Relative activations from the prompt determine expert neurons to use for generation.
  • Figure 4: Relative performance of GRIFFIN for Llama 2 7B (left), Gemma 7B (center), and Mistral 7B (right) as we enforce varying degrees of sparsity per FF block. For all tasks, the original model's performance for each task is normalized to 1.
  • Figure 5: Prompt length vs. generation length for Llama 2 7B (left) and Gemma 7B (right) as measured by increase in perplexity (PPL) from the full model on concatenated WikiText at 50% FF sparsity.
  • ...and 7 more figures