Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Siqi Fan; Xuezhi Fang; Xingrun Xing; Peng Han; Shuo Shang; Yequan Wang

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang

TL;DR

This paper tackles the high computational cost of inference in decoder-only LLMs by proposing a training-free, token-position aware depth-skipping method called $D^3$. It uses a power-law decay $\left\lfloor L \times (\alpha^i) \right\rfloor$ to determine how many transformer layers to keep for each generated token, guided by a core-flex layer separation and a focus on mitigating error propagation from missing states. Empirical results on LLaMA-family models (7B–70B) show average speedups of about $1.5\times$ with minimal accuracy loss ($<1\%$) on GSM8K and BBH, and the method remains compatible with batching and KV caching. The approach provides a practical, scalable addition to existing acceleration techniques, offering a new paradigm for efficient inference without retraining.

Abstract

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

TL;DR

This paper tackles the high computational cost of inference in decoder-only LLMs by proposing a training-free, token-position aware depth-skipping method called

. It uses a power-law decay

to determine how many transformer layers to keep for each generated token, guided by a core-flex layer separation and a focus on mitigating error propagation from missing states. Empirical results on LLaMA-family models (7B–70B) show average speedups of about

with minimal accuracy loss (

) on GSM8K and BBH, and the method remains compatible with batching and KV caching. The approach provides a practical, scalable addition to existing acceleration techniques, offering a new paradigm for efficient inference without retraining.

Abstract

), which leverages a power-law decay function,

, to determine the number of layers to retain when generating token

. Remarkably, without any retraining, the

achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with

billion parameters show that

can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop (

) on the GSM8K and BBH benchmarks.

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

TL;DR

Abstract

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)