Table of Contents
Fetching ...

Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou

TL;DR

This work addresses cross-tokenizer knowledge distillation (CTKD), where heterogeneous tokenizers cause sequence misalignment and vocabulary mismatch between teacher and student models. It introduces Contextual Dynamic Mapping (CDM), a two-component framework combining entropy-weighted Dynamic Time Warping (DTW) for sequence alignment and context-aware dynamic vocabulary mapping to align logits across tokenizers. Across five open-source model families and multiple task types, CDM consistently outperforms existing cross-tokenizer baselines and yields additional gains when combined with same-tokenizer KD, with dual-teacher setups showing further improvements. The approach enhances cross-architecture knowledge transfer and offers a scalable pathway toward more effective model compression and deployment in diverse tokenization regimes.

Abstract

Knowledge Distillation (KD) has emerged as a prominent technique for model compression. However, conventional KD approaches primarily focus on homogeneous architectures with identical tokenizers, constraining their applicability in cross-architecture scenarios. As for the cross-tokenizer KD, the differences in the tokenizers give rise to two fundamental challenges: (1) sequence misalignment caused by divergent tokenization strategies, and (2) mismatched vocabulary size and composition. While existing probability-matching methods attempt to address these issues, their efficacy remains limited due to suboptimal alignment in both the sequence and vocabulary aspects. To overcome these limitations, we propose Contextual Dynamic Mapping (CDM), a novel cross-tokenizer distillation framework that employs contextual information to enhance sequence alignment precision and dynamically improves vocabulary mapping. We evaluated the effectiveness of our approach across five advanced and widely-used model families (i.e, LLama3, Phi3, Gemma2, OPT and Qwen2), which were configured into three distinct teacher-student pairs. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks, including instruction-following, code generation and math. Notably, our analysis reveals that combining conventional same-tokenizer distillation and cross-tokenizer distillation through CDM yields further performance improvements. The code is available at https://github.com/pppa2019/ContexualDynamicMapping

Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping

TL;DR

This work addresses cross-tokenizer knowledge distillation (CTKD), where heterogeneous tokenizers cause sequence misalignment and vocabulary mismatch between teacher and student models. It introduces Contextual Dynamic Mapping (CDM), a two-component framework combining entropy-weighted Dynamic Time Warping (DTW) for sequence alignment and context-aware dynamic vocabulary mapping to align logits across tokenizers. Across five open-source model families and multiple task types, CDM consistently outperforms existing cross-tokenizer baselines and yields additional gains when combined with same-tokenizer KD, with dual-teacher setups showing further improvements. The approach enhances cross-architecture knowledge transfer and offers a scalable pathway toward more effective model compression and deployment in diverse tokenization regimes.

Abstract

Knowledge Distillation (KD) has emerged as a prominent technique for model compression. However, conventional KD approaches primarily focus on homogeneous architectures with identical tokenizers, constraining their applicability in cross-architecture scenarios. As for the cross-tokenizer KD, the differences in the tokenizers give rise to two fundamental challenges: (1) sequence misalignment caused by divergent tokenization strategies, and (2) mismatched vocabulary size and composition. While existing probability-matching methods attempt to address these issues, their efficacy remains limited due to suboptimal alignment in both the sequence and vocabulary aspects. To overcome these limitations, we propose Contextual Dynamic Mapping (CDM), a novel cross-tokenizer distillation framework that employs contextual information to enhance sequence alignment precision and dynamically improves vocabulary mapping. We evaluated the effectiveness of our approach across five advanced and widely-used model families (i.e, LLama3, Phi3, Gemma2, OPT and Qwen2), which were configured into three distinct teacher-student pairs. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks, including instruction-following, code generation and math. Notably, our analysis reveals that combining conventional same-tokenizer distillation and cross-tokenizer distillation through CDM yields further performance improvements. The code is available at https://github.com/pppa2019/ContexualDynamicMapping

Paper Structure

This paper contains 29 sections, 7 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: The illustration of the alignment process of cross-tokenizer knowledge distillation. A and B mean the tokenizers of the student or teacher models.
  • Figure 2: Matching rate of sequence alignment results.
  • Figure 3: Matching rate of vocabulary alignment results.
  • Figure 4: The architecture of CDM consists of two key components: an entropy-weighted Dynamic Time Warping (DTW) sequence alignment algorithm and a dynamic Top-K vocabulary mapping algorithm. Following the mapping procedure, the output representations from both the teacher and student models are aligned to ensure consistency in both dimensional structure and semantic space.