KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

Seongjin Cha; Gyuwan Kim; Dongsu Han; Tao Yang; Insu Han

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

TL;DR

KnapSpec is proposed, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput and provides the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.

Abstract

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

TL;DR

Abstract

Paper Structure (33 sections, 2 theorems, 14 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 33 sections, 2 theorems, 14 equations, 6 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Contributions
Preliminaries
Speculative Decoding
Self-Speculative Decoding with Layer Selection
Early-Exit SSD.
Layer-Skip SSD.
KnapSpec: Layer Selection for SSD as a Knapsack Problem
Motivation: Asymmetric Layer Latencies
Tokens-per-Time (TPT) Metric
Problem Formulation as a 0/1 Knapsack Problem
Efficient Optimization via Dynamic Programming
Integer Weight Normalization.
Recurrence Relation in DP.
...and 18 more sections

Key Result

Lemma 4.0

Given embeddings $w_1, ..., w_V \in \mathbb{R}^d$ and a fixed vector $x \in \mathbb{R}^d$, let $i^* := \mathop{\rm argmax}_{i} \left \langle {w_i, x} \right \rangle$ and define the margin $\xi(x) := \left \langle {w_{i^*}, x} \right \rangle - \max_{j \neq i^*} \left \langle {w_j, x} \right \rangle$. then $\mathop{\rm argmax}_{i \in [V]} \left \langle {w_i,x} \right \rangle = \mathop{\rm argmax}_{i

Figures (6)

Figure 1: Overview of KnapSpec. (1) Self-Speculative Decoding defines a draft model as a sub-network of the target model. (2) Layer selection can be converted to a Knapsack Problem, and we search candidate sets via Dynamic Programming (DP) with various latency budgets (weights). (3) Then, the optimal configuration that maximizes Tokens-per-Time (TPT) is selected and used for the next speculation.
Figure 2: Pearson Correlation Coefficients (PCC) and $R^2$-value of TPT and acceptance rate against throughput. Best TPT shows a much closer correlation with actual performance.
Figure 3: Comparison of skipped layers across context lengths up to 32k. KnapSpec chooses more Attention layers to skip than MLP as the length $n$ increases, while SWIFT chooses uniform layers.
Figure 4: Throughput, memory and optimization overhead reduction (%) with various similarity thresholds $\tau$. We select $\tau=0.5$ to achieve the best balance between speed and memory.
Figure 5: Layer-wise skip probabilities aggregated over evaluation samples. The non-uniform distribution demonstrates the model-specific redundancies. While certain layers are frequently skipped to reduce computational cost, some layers consistently show low skip probabilities.
...and 1 more figures

Theorems & Definitions (3)

Lemma 4.0
Lemma 1.0
proof

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

TL;DR

Abstract

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)