Table of Contents
Fetching ...

DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

Ning Yang, Fangxin Liu, Junjie Wang, Tao Yang, Kan Liu, Haibing Guan, Li Jiang

TL;DR

DASH tackles the latency and cost of large language model inference by introducing an input-aware dynamic layer-skipping framework. It frames skipping as a Markov Decision Process, uses a scoring model to decide per-layer actions, and couples this with a compensation mechanism to mitigate information loss. An asynchronous strategy hides decision latency by overlapping policy evaluation with computation. Empirical results show significant speedups across multiple backbones with minimal performance degradation, highlighting the practicality of dynamic, input-conditioned skipping for real-world deployment.

Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.

DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

TL;DR

DASH tackles the latency and cost of large language model inference by introducing an input-aware dynamic layer-skipping framework. It frames skipping as a Markov Decision Process, uses a scoring model to decide per-layer actions, and couples this with a compensation mechanism to mitigate information loss. An asynchronous strategy hides decision latency by overlapping policy evaluation with computation. Empirical results show significant speedups across multiple backbones with minimal performance degradation, highlighting the practicality of dynamic, input-conditioned skipping for real-world deployment.

Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.

Paper Structure

This paper contains 15 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Cosine Similarity and Model Accuracy Analysis. The left panel illustrates the cosine similarity across layers, indicating that representation similarity stabilizes after the initial layers despite early fluctuations. The right panel shows a precipitous decline in model accuracy as the number of skipped layers increases, emphasizing that minimal skipping maintains reasonable performance while excessive skipping results in a substantial accuracy drop.
  • Figure 2: IO similarities between different samples on Qwen model with MMLU dataset.
  • Figure 3: Overview of the DASH Framework. This method first processes the embedding layer and maintains full-precision computation in the first Transformer layer. Starting from the second layer, the scoring model evaluates the next layer's state using the modified input of the current layer, dynamically selecting the next layer's state. When a layer is skipped, a compensation mechanism is activated based on the scoring results, effectively balancing inference speed and model accuracy.
  • Figure 4: I/O similarity and layer-skipping states at different speedup ratios.The higher the I/O similarity, the more aggressive the layer-skipping strategy becomes, preferentially selecting layer states with higher acceleration ratios.
  • Figure 5: Results on MMLU datasets. Decision system trained on different datasets by Qwen-2.5-7B