Table of Contents
Fetching ...

FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping

Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella

TL;DR

This work addresses the substantial inference cost of autoregressive LLMs by proposing FFN-SkipLLM, a fine-grained, input-adaptive skipping strategy that targets FFN blocks rather than whole layers. By leveraging a cosine-similarity metric to identify redundancy, FFN-SkipLLM safely skips 25–30% of FFN computations without meaningful loss on knowledge-intensive tasks, and crucially avoids KV-cache pitfalls that plague prior layer-skipping methods. Across factoid QA, in-context summarization, and multi-turn conversations, the approach maintains near full-model performance at modest skip ratios while reducing hallucinations and token collapse. The solution offers a simple, hardware-friendly path to faster autoregressive decoding and can be combined with other efficiency techniques, expanding practical deployment of large LLMs.

Abstract

Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.

FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping

TL;DR

This work addresses the substantial inference cost of autoregressive LLMs by proposing FFN-SkipLLM, a fine-grained, input-adaptive skipping strategy that targets FFN blocks rather than whole layers. By leveraging a cosine-similarity metric to identify redundancy, FFN-SkipLLM safely skips 25–30% of FFN computations without meaningful loss on knowledge-intensive tasks, and crucially avoids KV-cache pitfalls that plague prior layer-skipping methods. Across factoid QA, in-context summarization, and multi-turn conversations, the approach maintains near full-model performance at modest skip ratios while reducing hallucinations and token collapse. The solution offers a simple, hardware-friendly path to faster autoregressive decoding and can be combined with other efficiency techniques, expanding practical deployment of large LLMs.

Abstract

Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.
Paper Structure (18 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Merits of Autoregressive Decoding with Layer Skipping: Comparison of the responses generated by two recent Layer Skipping methods, namely SkipDecode del2023skipdecode and ShortGPT men2024shortgpt for a knowledge-intensive QA example. It can be observed that both LLaMa-chat-13B model with $\sim$ 25% layers skipped per token using SkipDecode and ShortGPT suffers from hallucination and token collapse (repetitive generation) while FFN-SkipLLM can still retrieve the correct response.
  • Figure 2: Cosine similarity across embedding dimension of a token tensor entering before and after the FFN block of different layers in LLaMa-2 7B and 13B model. Inputs are sampled at random from Wikitext ad C4 datasets and the mean curve indicates the average cosine similarity across 128 generated tokens. Red regions are termed cold regions in our work and skipping FFN blocks within this region significantly hurt LLMs performance.
  • Figure 3: Performance comparison of our baselines wrt. FFN-SkipLLM for in-context summarization of small (row 1), medium (row 2), and large (row 3) stories while preserving coherence, consistency, fluency, and relevance.
  • Figure 4: Examples of prompts used for different categories to evaluate the compressed LLM ASSISTANT wrt. GPT-3.5 ASSISTANT using GPT-4 as a Judge.
  • Figure 5: Performance comparison of our baselines with varying layer skip ratios wrt. FFN-SkipLLM on multi-turn conversation across 8 different categories.
  • ...and 1 more figures