FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella
TL;DR
This work addresses the substantial inference cost of autoregressive LLMs by proposing FFN-SkipLLM, a fine-grained, input-adaptive skipping strategy that targets FFN blocks rather than whole layers. By leveraging a cosine-similarity metric to identify redundancy, FFN-SkipLLM safely skips 25–30% of FFN computations without meaningful loss on knowledge-intensive tasks, and crucially avoids KV-cache pitfalls that plague prior layer-skipping methods. Across factoid QA, in-context summarization, and multi-turn conversations, the approach maintains near full-model performance at modest skip ratios while reducing hallucinations and token collapse. The solution offers a simple, hardware-friendly path to faster autoregressive decoding and can be combined with other efficiency techniques, expanding practical deployment of large LLMs.
Abstract
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.
