Table of Contents
Fetching ...

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun

TL;DR

It is shown that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods.

Abstract

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which is leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup in inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

TL;DR

It is shown that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods.

Abstract

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which is leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup in inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.

Paper Structure

This paper contains 43 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Speedup ratio relative to the standard autoregressive greedy decoding on various multilingual datasets. Target model is Vicuna 7B v1.3 and the drafter is Vicuna 68M. Speculative greedy sampling is implemented with the drafter by vicuna68m (green) and our specialized drafter (pretrain-and-finetune) (red).
  • Figure 2: Speedup comparison of various speculative decoding methods on WMT16 De-En dataset wmt16 with greedy settings ($T$=$0.0$) across various hardwares. Target model is Vicuna-7B.
  • Figure 3: Speedup comparison across categories containing multi-turn conversation (MT-Bench) llmjudge, math reasoning (GSM8K) gsm8k, and translation (WMT16 De-En). Target model is Vicuna-7B with speculative greedy sampling.
  • Figure 4: Speedup with speculative greedy sampling on the WMT16 De-En dataset as the training token for finetune (F) count varies, displayed on a logarithmic x-axis. 'P-F' represents our strategy and 'F' involves training solely on De-En without pretrain step (P).
  • Figure 5: Speedup with speculative greedy sampling on various out-of-domain dataset as the drafters for 'Ours (P-F)' and 'F' are trained on WMT16 De-En dataset.
  • ...and 3 more figures