Table of Contents
Fetching ...

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu

TL;DR

AdaSkip addresses the high cost of long-context LLM inference by introducing adaptive sublayer skipping that operates on both transformer attention and FFN sublayers. It combines offline importance learning for the prefilling phase with online importance learning during decoding to identify and skip the least important sublayers, controlled by an acceleration ratio $\alpha$ and orientation toward preserving generation quality. Empirically, AdaSkip outperforms fixed-layer skipping baselines across prefilling, decoding, and end-to-end tasks on diverse long-context benchmarks and models, achieving up to $\sim$17% speedups while maintaining Rouge-L and other quality metrics. This approach reduces TTFT and KV-cache costs in long-context scenarios and does not require additional training, offering practical benefits for real-world long-context applications.

Abstract

Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

TL;DR

AdaSkip addresses the high cost of long-context LLM inference by introducing adaptive sublayer skipping that operates on both transformer attention and FFN sublayers. It combines offline importance learning for the prefilling phase with online importance learning during decoding to identify and skip the least important sublayers, controlled by an acceleration ratio and orientation toward preserving generation quality. Empirically, AdaSkip outperforms fixed-layer skipping baselines across prefilling, decoding, and end-to-end tasks on diverse long-context benchmarks and models, achieving up to 17% speedups while maintaining Rouge-L and other quality metrics. This approach reduces TTFT and KV-cache costs in long-context scenarios and does not require additional training, offering practical benefits for real-world long-context applications.

Abstract

Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.
Paper Structure (19 sections, 5 equations, 4 figures, 4 tables)

This paper contains 19 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparisons of different skipping strategies. The dashed box indicates the layer to be skipped.
  • Figure 2: IO similarities of different layers in various transformer models.
  • Figure 3: IO similarities of attention (ATTN) and FFN modules in different layers.
  • Figure 4: IO similarities of sublayer modules in prefilling (P) and decoding (D) phases.