Table of Contents
Fetching ...

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

Fenglu Hong, Ravi Raju, Jonathan Lingjie Li, Bo Li, Urmish Thakker, Avinash Ravichandran, Swayambhoo Jain, Changran Hu

TL;DR

The paper tackles the problem of domain shift reducing the effectiveness of speculative decoding when targeting domain-specific LLMs. It systematically compares white-box and black-box knowledge distillation, across three data-access scenarios, and evaluates performance on Function Calling, Biology, and Chinese domains. Key findings show that offline distillation with forward KL losses and white-box supervision consistently outperform online learning and black-box alternatives, while synthetic Magpie data can closely approximate in-domain training. These insights yield practical guidelines for building domain-adapted draft models to accelerate inference while preserving domain-specific performance.

Abstract

Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

TL;DR

The paper tackles the problem of domain shift reducing the effectiveness of speculative decoding when targeting domain-specific LLMs. It systematically compares white-box and black-box knowledge distillation, across three data-access scenarios, and evaluates performance on Function Calling, Biology, and Chinese domains. Key findings show that offline distillation with forward KL losses and white-box supervision consistently outperform online learning and black-box alternatives, while synthetic Magpie data can closely approximate in-domain training. These insights yield practical guidelines for building domain-adapted draft models to accelerate inference while preserving domain-specific performance.

Abstract

Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Three data accessibility scenarios for domain draft model training. Scenario I assumes access to historical user queries and train the draft model with distillation losses given target model's generations. Scenario II and III assume no access to use queries. We can use either collected domain queries (II) or synthetic queries generated by the target model (III) for training.
  • Figure 2: Average acceptance rates for different methods. All methods (except SFT - Magpie) train with in-domain data where a domain-specific dataset is split into training and test sets, mimicking real user queries (Scenario I). SFT - Magpie method trains with Magpie synthetic data (Scenario III). More details in Appendix \ref{['sec:appendix main results']}.
  • Figure 3: Performance scales with dataset size. Besides, as training data increases, the offline KL approach gains an increasing advantage over online KL in Biology and Chinese domains.