Table of Contents
Fetching ...

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji

TL;DR

A parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models, and effectively restores speculative performance on fine-tuned models.

Abstract

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

TL;DR

A parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models, and effectively restores speculative performance on fine-tuned models.

Abstract

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.
Paper Structure (27 sections, 15 equations, 4 figures, 4 tables)

This paper contains 27 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparison of average acceptance length on GSM8K under different draft-target pairings. We refer to the target models as $T_{Base}$ (Qwen2.5-7B) and $T_{Math}$ (Qwen2.5-Math-7B). The target model $T_{Base}$'s draft model $D_{Base}$ achieves high average acceptance length when paired together ($D_{Base} \rightarrow T_{Base}$), while suffers a substantial acceptance drop when paired with $T_{Math}$ ($D_{Base} \rightarrow T_{Math}$). Our EDA framework can restore performance through efficient, lightweight adaptation.
  • Figure 2: The EDA framework for efficient draft model adaptation, combining shared–private draft decomposition, domain-specific self-generation, and representation-shift–based data selection to restore speculative decoding performance with minimal training cost.
  • Figure 3: Comparison of different data selection strategies under varying data budgets. Draft adaptation from Qwen2.5-7B to Qwen2.5-Math-7B, measured on 4$\times$ NVIDIA H200 GPUs; $\tau$ is evaluated on GSM8K under this setup.
  • Figure 4: Qualitative example comparing direct transfer and EDA-adapted drafts on Qwen2.5-Coder-7B. Red tokens indicate incorrect predictions, while black tokens indicate correct predictions.