Table of Contents
Fetching ...

SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong

TL;DR

This work tackles the latency of autoregressive LLM decoding by proposing SDFP, a training-free, plug-and-play framework that builds a lightweight draft model via Fisher Information Trace (FIT) based layer pruning. The pruned draft is combined with speculative decoding to verify proposed tokens against the full model, preserving the exact output distribution without retraining. The approach demonstrates 1.32×–1.5× end-to-end speedups across diverse tasks and model sizes, with minimal offline overhead and no task-specific optimization. The practical impact is accelerated, deployment-friendly LLM inference suitable for real-time multimedia applications without sacrificing fidelity.

Abstract

Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.

SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

TL;DR

This work tackles the latency of autoregressive LLM decoding by proposing SDFP, a training-free, plug-and-play framework that builds a lightweight draft model via Fisher Information Trace (FIT) based layer pruning. The pruned draft is combined with speculative decoding to verify proposed tokens against the full model, preserving the exact output distribution without retraining. The approach demonstrates 1.32×–1.5× end-to-end speedups across diverse tasks and model sizes, with minimal offline overhead and no task-specific optimization. The practical impact is accelerated, deployment-friendly LLM inference suitable for real-time multimedia applications without sacrificing fidelity.

Abstract

Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
Paper Structure (15 sections, 11 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the SDFP framework. SDFP integrates FIT-based pruning and speculative decoding to enable training-free and plug-and-play acceleration of LLMs. In Stage 1, FIT scores, computed using Fisher Information and Parameter Perturbation Variance, are used to rank layer sensitivity and prune redundant layers, resulting in a compact draft model. In Stage 2, the pruned model acts as the draft to generate speculative tokens, which are then verified by the base model. Accepted tokens are committed to the output, while rejected ones are regenerated, achieving efficient decoding without retraining.
  • Figure 2: Comparison between previous optimization-based acceleration methods and our direct inference acceleration approach. Previous methods incur a substantial upfront optimization phase over the first $mN$ tokens before achieving limited decoding acceleration. In contrast, our method applies FIT-based pruning and speculative decoding directly at inference time, enabling end-to-end acceleration across the entire generation process without any offline optimization overhead.
  • Figure 3: Layer-wise FIT sensitivity heatmaps computed on WikiText2 for (a) LLaMA-2-13B and (b) LLaMA-2-7B. Both models exhibit non-uniform sensitivity distributions across transformer layers, indicating intrinsic layer-wise redundancy and motivating FIT-guided layer pruning in SDFP.