COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
Eugene Kwek, Wenpeng Yin
TL;DR
COMPACT addresses the inefficiency of pruning large language models by jointly pruning rare vocabulary and common-token FFN channels while preserving a standard transformer architecture. It leverages the scale-dependent parameter distribution, where embeddings dominate in small models and FFNs dominate in large models, and uses a common-token distribution to guide FFN channel importance via a weighted act$^2$ criterion. The method is training-free, architecture-agnostic, and tunable for different budgets, achieving state-of-the-art downstream performance on 0.5B–70B models with substantial reductions in parameters, GPU memory, and latency. The practical impact is improved on-device deployment, broader accessibility, and more efficient serving of LLMs across varied hardware and application settings.
Abstract
Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
