Table of Contents
Fetching ...

Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching

Benjamin Minixhofer, Ivan Vulić, Edoardo Maria Ponti

TL;DR

Distillation across fundamentally different tokenizers has been a bottleneck for expanding teacher–student pairings. The paper introduces Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation objective that aligns chunk-level teacher and student likelihoods via a binarised $f$-divergence, aided by outcome chunk debiasing and optional hidden-state distillation. Across three use cases, ALM outperforms prior cross-tokenizer distillation methods, enables effective self-distillation to enable Subword→Byte transfers and ensembling, and scales to larger models with improved maths/problem-solving transfer and zero-shot tokenizer transfer via tokenizer-transfer hypernetworks. The approach substantially broadens the space of feasible teacher–student pairs, enabling new applications and more flexible interaction among LLMs, with public code available for replication. Overall, ALM provides a robust, scalable path to tokenizer-agnostic distillation and model interaction.

Abstract

Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a principled cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases. We verify the efficacy of our method on three distinct use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedentedly effective transfer across tokenizers, including rapid transfer of subword models to the byte-level. Transferring different models to the same tokenizer also enables ensembling to boost performance. Secondly, we distil a large maths-specialised LLM into a small general-purpose model with a different tokenizer, achieving competitive maths problem-solving performance. Thirdly, we use our method to train state-of-the-art embedding prediction hypernetworks for training-free tokenizer transfer. Our results unlock an expanded range of teacher-student pairs for distillation, enabling new ways to adapt and enhance interaction between LLMs.

Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching

TL;DR

Distillation across fundamentally different tokenizers has been a bottleneck for expanding teacher–student pairings. The paper introduces Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation objective that aligns chunk-level teacher and student likelihoods via a binarised -divergence, aided by outcome chunk debiasing and optional hidden-state distillation. Across three use cases, ALM outperforms prior cross-tokenizer distillation methods, enables effective self-distillation to enable Subword→Byte transfers and ensembling, and scales to larger models with improved maths/problem-solving transfer and zero-shot tokenizer transfer via tokenizer-transfer hypernetworks. The approach substantially broadens the space of feasible teacher–student pairs, enabling new applications and more flexible interaction among LLMs, with public code available for replication. Overall, ALM provides a robust, scalable path to tokenizer-agnostic distillation and model interaction.

Abstract

Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a principled cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases. We verify the efficacy of our method on three distinct use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedentedly effective transfer across tokenizers, including rapid transfer of subword models to the byte-level. Transferring different models to the same tokenizer also enables ensembling to boost performance. Secondly, we distil a large maths-specialised LLM into a small general-purpose model with a different tokenizer, achieving competitive maths problem-solving performance. Thirdly, we use our method to train state-of-the-art embedding prediction hypernetworks for training-free tokenizer transfer. Our results unlock an expanded range of teacher-student pairs for distillation, enabling new ways to adapt and enhance interaction between LLMs.

Paper Structure

This paper contains 29 sections, 15 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: We propose a cross-tokenizer distillation method which identifies comparable chunks of tokens, then minimizes the differences between their likelihoods (c.f. Section \ref{['sec:method']}).
  • Figure 2: Outcome chunk debiasing removes tokenization bias. For example, the low probability of the subword token $\texttt{\_Wor}$ would be matched to the high-probability byte sequence $\{\texttt{\_},\texttt{W},\texttt{o},\texttt{r}\}$ in naive subword $\rightarrow$ byte transfer. We can debias by multiplying by the marginal probability of a pretoken-boundary byte occurring after the chunk. In this example, $\{\texttt{\textbackslash{}n},\texttt{!}\}\subseteq \mathcal{B}$ and $\texttt{s} \notin \mathcal{B}$, $\texttt{l} \notin \mathcal{B}$ where $\mathcal{B}$ is the set of shared pretoken-boundary bytes across the teacher and the student.
  • Figure 3: Efficiency and task performance metrics of cross-tokenizer distillation methods, measured via worst-case performance across transfer of Gemma2 to Qwen2 and byte-level tokenizers. SFT denotes the required FLOPs and memory as well as the task performance of the SFT baseline.
  • Figure 4: The KL-Divergence gradients $\frac{\delta f_{\text{KL}}(p^{1/\tau\}}\|q^{1/\tau}) + \delta f_{\text{KL}}(1 - p^{1/\tau\}}\|1 - q^{1/\tau})}{\delta \log q}$ over $\tau$.
  • Figure 5: The Total Variation Distance gradients $\frac{\delta f_{\text{TVD}}(p^{1/\tau\}}\|q^{1/\tau}) + \delta f_{\text{TVD}}(1 - p^{1/\tau\}}\|1 -q^{1/\tau})}{\delta \log q}$ over $\tau$.