Table of Contents
Fetching ...

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

TL;DR

OmniDraft introduces a universal on-device drafter capable of pairing with diverse target LLMs through cross-vocabulary speculative decoding. It combines an online n-gram cache to translate between vocabularies, a hybrid online distillation loss to align draft and target distributions, and adaptive drafting to balance latency and throughput. The approach yields consistent improvements in acceptance rate and speedup across multiple tasks and target families, with manageable on-device cache footprints. This work enables a single lightweight model to support multiple targets, reducing deployment overhead and enabling dynamic personalization in edge environments. The combination of cross-vocabulary translation, online alignment, and adaptive drafting provides a practical path toward flexible, efficient on-device LLM inference at scale.

Abstract

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

TL;DR

OmniDraft introduces a universal on-device drafter capable of pairing with diverse target LLMs through cross-vocabulary speculative decoding. It combines an online n-gram cache to translate between vocabularies, a hybrid online distillation loss to align draft and target distributions, and adaptive drafting to balance latency and throughput. The approach yields consistent improvements in acceptance rate and speedup across multiple tasks and target families, with manageable on-device cache footprints. This work enables a single lightweight model to support multiple targets, reducing deployment overhead and enabling dynamic personalization in edge environments. The combination of cross-vocabulary translation, online alignment, and adaptive drafting provides a practical path toward flexible, efficient on-device LLM inference at scale.

Abstract

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

Paper Structure

This paper contains 44 sections, 4 equations, 9 figures, 14 tables, 4 algorithms.

Figures (9)

  • Figure 1: Overview of the OmniDraft framework: during cross-vocabulary speculative decoding, the drafter (Llama-68M) generates multiple tokens $d_i$ with corresponding distributions $q_i$. Cross-vocabulary translator then converts the drafter tokens into tokens in the vocabulary of the target model (Llama3-8B). In this example, token $d_0$('Snow') and $d_4$('is') are directly mapped to target tokens $t_0$ and $t_2$, while token $d_1$('f'), $d_2$('la') and $d_3$('ke') are merged into a single target token $t_1$ ('flake'), since there is a mapping item in the n-gram cache. The translated proposal $t_i$ along with combined probabilities $q'_i$ is verified by the target model, resulting in $t_0$ and $t_1$ being accepted while $t_2$ being rejected and replaced by $t'_2$. The target outputs tokens and their probabilities $p_i$ are translated into drafter tokens and sent back to drafter for next round of drafting. The n-gram cache is updated by inserting a new unseen item ('st','amps'->'stamps'). Meanwhile, the accepted and corrected tokens from the target model are used to align the drafter through online cross-vocabulary distillation.
  • Figure 2: Cross-vocabulary SpD online distillation on Llama-68M with Qwen2-7B as target
  • Figure 3: Online Adaptive Drafting training plot of Llama-68M vs Vicuna-7B
  • Figure 4: Cross-vocabulary SpD online distillation on Llama-68M with Llama3-8B as target
  • Figure 5: Dataset Drift tracking during training by switching the dataset from GSM8K to MBPP+HumanEval at Step 1000 for Llama-60M vs Qwen-2.5-7B model
  • ...and 4 more figures