Table of Contents
Fetching ...

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang

TL;DR

This paper tackles the high computational cost of inference in decoder-only LLMs by proposing a training-free, token-position aware depth-skipping method called $D^3$. It uses a power-law decay $\left\lfloor L \times (\alpha^i) \right\rfloor$ to determine how many transformer layers to keep for each generated token, guided by a core-flex layer separation and a focus on mitigating error propagation from missing states. Empirical results on LLaMA-family models (7B–70B) show average speedups of about $1.5\times$ with minimal accuracy loss ($<1\%$) on GSM8K and BBH, and the method remains compatible with batching and KV caching. The approach provides a practical, scalable addition to existing acceleration techniques, offering a new paradigm for efficient inference without retraining.

Abstract

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

TL;DR

This paper tackles the high computational cost of inference in decoder-only LLMs by proposing a training-free, token-position aware depth-skipping method called . It uses a power-law decay to determine how many transformer layers to keep for each generated token, guided by a core-flex layer separation and a focus on mitigating error propagation from missing states. Empirical results on LLaMA-family models (7B–70B) show average speedups of about with minimal accuracy loss () on GSM8K and BBH, and the method remains compatible with batching and KV caching. The approach provides a practical, scalable addition to existing acceleration techniques, offering a new paradigm for efficient inference without retraining.

Abstract

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding (), which leverages a power-law decay function, , to determine the number of layers to retain when generating token . Remarkably, without any retraining, the achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with billion parameters show that can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop () on the GSM8K and BBH benchmarks.

Paper Structure

This paper contains 30 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: $D^3$'s generation process vs. (a) standard implementation, (b) Early Exit, and (c) SkipDecode.
  • Figure 2: Error propagation and explanation from perplexity (PPL) behavior in filling missing States.
  • Figure 3: Visualization of input/output information flow, including features hidden state, mlp, and attention activation value, for each block during training.
  • Figure 4: The layer usage for the current generated token $T_i$ follows a power-law decay function with decode time steps.
  • Figure 5: $D^3$vs. Full Depth performance difference (±) on BBH benchmarks
  • ...and 2 more figures