Table of Contents
Fetching ...

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

TL;DR

KnapSpec is proposed, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput and provides the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.

Abstract

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

TL;DR

KnapSpec is proposed, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput and provides the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.

Abstract

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.
Paper Structure (33 sections, 2 theorems, 14 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 33 sections, 2 theorems, 14 equations, 6 figures, 2 tables, 3 algorithms.

Key Result

Lemma 4.0

Given embeddings $w_1, ..., w_V \in \mathbb{R}^d$ and a fixed vector $x \in \mathbb{R}^d$, let $i^* := \mathop{\rm argmax}_{i} \left \langle {w_i, x} \right \rangle$ and define the margin $\xi(x) := \left \langle {w_{i^*}, x} \right \rangle - \max_{j \neq i^*} \left \langle {w_j, x} \right \rangle$. then $\mathop{\rm argmax}_{i \in [V]} \left \langle {w_i,x} \right \rangle = \mathop{\rm argmax}_{i

Figures (6)

  • Figure 1: Overview of KnapSpec. (1) Self-Speculative Decoding defines a draft model as a sub-network of the target model. (2) Layer selection can be converted to a Knapsack Problem, and we search candidate sets via Dynamic Programming (DP) with various latency budgets (weights). (3) Then, the optimal configuration that maximizes Tokens-per-Time (TPT) is selected and used for the next speculation.
  • Figure 2: Pearson Correlation Coefficients (PCC) and $R^2$-value of TPT and acceptance rate against throughput. Best TPT shows a much closer correlation with actual performance.
  • Figure 3: Comparison of skipped layers across context lengths up to 32k. KnapSpec chooses more Attention layers to skip than MLP as the length $n$ increases, while SWIFT chooses uniform layers.
  • Figure 4: Throughput, memory and optimization overhead reduction (%) with various similarity thresholds $\tau$. We select $\tau=0.5$ to achieve the best balance between speed and memory.
  • Figure 5: Layer-wise skip probabilities aggregated over evaluation samples. The non-uniform distribution demonstrates the model-specific redundancies. While certain layers are frequently skipped to reduce computational cost, some layers consistently show low skip probabilities.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Lemma 4.0
  • Lemma 1.0
  • proof