Table of Contents
Fetching ...

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum

TL;DR

This work tackles the high inference cost of large language models by introducing Fine-Grained Token Pruning (FTP), a token-routing framework that adaptively skips tokens within each transformer block to reduce computation without retraining. FTP relies on a GA-based sparsity scheduler to allocate block-wise sparsity, a static router to initialize pruning, and a dynamic router that ingests four low-dimensional features to decide, per token, whether to compute or skip, guided by three losses (guide, sparsity constraint, and distillation). The training workflow alternates between sparsity-search and router optimization, followed by fine-tuning of the sparsity schedule, and can be repeated to further enhance performance. Across multiple models (e.g., LLaMA2-7B/13B, Qwen1.5-7B) and benchmarks (ARC, HellaSwag, MMLU, WinoGrande), FTP achieves state-of-the-art pruning results, maintaining high accuracy even at substantial sparsity and showing favorable speedups and KV-cache compatibility for practical deployment.

Abstract

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

TL;DR

This work tackles the high inference cost of large language models by introducing Fine-Grained Token Pruning (FTP), a token-routing framework that adaptively skips tokens within each transformer block to reduce computation without retraining. FTP relies on a GA-based sparsity scheduler to allocate block-wise sparsity, a static router to initialize pruning, and a dynamic router that ingests four low-dimensional features to decide, per token, whether to compute or skip, guided by three losses (guide, sparsity constraint, and distillation). The training workflow alternates between sparsity-search and router optimization, followed by fine-tuning of the sparsity schedule, and can be repeated to further enhance performance. Across multiple models (e.g., LLaMA2-7B/13B, Qwen1.5-7B) and benchmarks (ARC, HellaSwag, MMLU, WinoGrande), FTP achieves state-of-the-art pruning results, maintaining high accuracy even at substantial sparsity and showing favorable speedups and KV-cache compatibility for practical deployment.

Abstract

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.

Paper Structure

This paper contains 20 sections, 7 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of LLM structure and router workflow. (Left) Dense Transformer where all tokens are processed in every block. (Middle) Token Router for Transformer, which dynamically selects tokens to compute or skip based on their importance and block-wise sparsity at each block. (Right) A detailed view of how the Token Router uses token importance features to make binary decisions (compute or skip) for each token within a block.
  • Figure 2: Token similarity across different transformer blocks.
  • Figure 3: Overview of our method. (a) Our Fine-Grained Token Pruning uses token position $p$, absolute attention scores $s_a$, relative attention score rank $r_a$ and sparsity requirement $s_r$ to guide gate prediction, skipping computation instead of discarding tokens. A GA-based scheduler optimizes sparsity per block, and the router is trained with three proposed losses. (b) We decouple sparsity scheduling and router training into three steps, simplifying the optimization.