Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
Benjamin Minixhofer, Ivan Vulić, Edoardo Maria Ponti
TL;DR
Distillation across fundamentally different tokenizers has been a bottleneck for expanding teacher–student pairings. The paper introduces Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation objective that aligns chunk-level teacher and student likelihoods via a binarised $f$-divergence, aided by outcome chunk debiasing and optional hidden-state distillation. Across three use cases, ALM outperforms prior cross-tokenizer distillation methods, enables effective self-distillation to enable Subword→Byte transfers and ensembling, and scales to larger models with improved maths/problem-solving transfer and zero-shot tokenizer transfer via tokenizer-transfer hypernetworks. The approach substantially broadens the space of feasible teacher–student pairs, enabling new applications and more flexible interaction among LLMs, with public code available for replication. Overall, ALM provides a robust, scalable path to tokenizer-agnostic distillation and model interaction.
Abstract
Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a principled cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases. We verify the efficacy of our method on three distinct use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedentedly effective transfer across tokenizers, including rapid transfer of subword models to the byte-level. Transferring different models to the same tokenizer also enables ensembling to boost performance. Secondly, we distil a large maths-specialised LLM into a small general-purpose model with a different tokenizer, achieving competitive maths problem-solving performance. Thirdly, we use our method to train state-of-the-art embedding prediction hypernetworks for training-free tokenizer transfer. Our results unlock an expanded range of teacher-student pairs for distillation, enabling new ways to adapt and enhance interaction between LLMs.
