Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Luxi Lin; Zhihang Lin; Zhanpeng Zeng; Yuhao Chen; Qingyu Zhang; Jixiang Luo; Xuelong Li; Rongrong Ji

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji

TL;DR

A parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models, and effectively restores speculative performance on fine-tuned models.

Abstract

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

TL;DR

Abstract

Paper Structure (27 sections, 15 equations, 4 figures, 4 tables)

This paper contains 27 sections, 15 equations, 4 figures, 4 tables.

Introduction
Related Work
Speculative Decoding
Parameter-Efficient Fine-Tuning
Preliminaries
Speculative Decoding
Average Acceptance Length
Method
Modeling Shared and Private Output Distributions Separately
Shared–Private Gated Draft Architecture.
Expert Parameterization and Draft Distribution.
Draft Model Initialization and Adaptation.
Matching Training and Drafting Objective.
Adapting Draft Model in a Data-Efficient Manner
Experiments
...and 12 more sections

Figures (4)

Figure 1: The comparison of average acceptance length on GSM8K under different draft-target pairings. We refer to the target models as $T_{Base}$ (Qwen2.5-7B) and $T_{Math}$ (Qwen2.5-Math-7B). The target model $T_{Base}$'s draft model $D_{Base}$ achieves high average acceptance length when paired together ($D_{Base} \rightarrow T_{Base}$), while suffers a substantial acceptance drop when paired with $T_{Math}$ ($D_{Base} \rightarrow T_{Math}$). Our EDA framework can restore performance through efficient, lightweight adaptation.
Figure 2: The EDA framework for efficient draft model adaptation, combining shared–private draft decomposition, domain-specific self-generation, and representation-shift–based data selection to restore speculative decoding performance with minimal training cost.
Figure 3: Comparison of different data selection strategies under varying data budgets. Draft adaptation from Qwen2.5-7B to Qwen2.5-Math-7B, measured on 4$\times$ NVIDIA H200 GPUs; $\tau$ is evaluated on GSM8K under this setup.
Figure 4: Qualitative example comparing direct transfer and EDA-adapted drafts on Qwen2.5-Coder-7B. Red tokens indicate incorrect predictions, while black tokens indicate correct predictions.

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

TL;DR

Abstract

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)