FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum
TL;DR
This work tackles the high inference cost of large language models by introducing Fine-Grained Token Pruning (FTP), a token-routing framework that adaptively skips tokens within each transformer block to reduce computation without retraining. FTP relies on a GA-based sparsity scheduler to allocate block-wise sparsity, a static router to initialize pruning, and a dynamic router that ingests four low-dimensional features to decide, per token, whether to compute or skip, guided by three losses (guide, sparsity constraint, and distillation). The training workflow alternates between sparsity-search and router optimization, followed by fine-tuning of the sparsity schedule, and can be repeated to further enhance performance. Across multiple models (e.g., LLaMA2-7B/13B, Qwen1.5-7B) and benchmarks (ARC, HellaSwag, MMLU, WinoGrande), FTP achieves state-of-the-art pruning results, maintaining high accuracy even at substantial sparsity and showing favorable speedups and KV-cache compatibility for practical deployment.
Abstract
Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.
