Table of Contents
Fetching ...

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong

TL;DR

Vocabulary mismatch between teacher and student LLMs limits cross-vocabulary distillation. VocAgnoLM introduces Token-level Lexical Alignment to map student tokens to ranges of teacher tokens and a Teacher Guided Loss to reweight student tokens using mapped teacher losses, enabling effective guidance without vocabulary compatibility. Across continual pretraining on OpenWebMath with TinyLlama 1.1B and 7B-scale math teachers, VocAgnoLM achieves substantial gains, with performance improvements scaling with teacher strength and outperforming both KLD and ULD baselines under vocabulary divergence. The approach highlights the importance of fine-grained sequence alignment, unmapped-token handling, and multi-mapped token aggregation, offering a practical path to leveraging diverse, domain-specific teachers for vocabulary-agnostic pretraining with real-world impact in mathematical reasoning tasks.

Abstract

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

TL;DR

Vocabulary mismatch between teacher and student LLMs limits cross-vocabulary distillation. VocAgnoLM introduces Token-level Lexical Alignment to map student tokens to ranges of teacher tokens and a Teacher Guided Loss to reweight student tokens using mapped teacher losses, enabling effective guidance without vocabulary compatibility. Across continual pretraining on OpenWebMath with TinyLlama 1.1B and 7B-scale math teachers, VocAgnoLM achieves substantial gains, with performance improvements scaling with teacher strength and outperforming both KLD and ULD baselines under vocabulary divergence. The approach highlights the importance of fine-grained sequence alignment, unmapped-token handling, and multi-mapped token aggregation, offering a practical path to leveraging diverse, domain-specific teachers for vocabulary-agnostic pretraining with real-world impact in mathematical reasoning tasks.

Abstract

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Paper Structure

This paper contains 39 sections, 7 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Limitation in Utilizing Better LLMs as Teacher Models due to Vocabulary Mismatch: Qwen2.5-Math yang2024qwen25mathtechnicalreportmathematical outperforms Llemma azerbayev2024llemma on math evaluation suite, but shares only 6.32% of its vocabulary with the student model, TinyLlama zhang2024tinyllamaopensourcesmalllanguage.
  • Figure 2: Overview of Vocabulary-agnostic Teacher Guided Language Modeling. Left: Teacher models (such as Qwen, Mistral, DeepSeek) produce token sequences that differ from those of the student model (TinyLlama), leading to misalignment. Middle: To address this, Token-level Lexical Mapping establishes a one-to-many mapping from each student token to corresponding teacher tokens. Right: To overcome logit distribution divergence, the mapped teacher token loss is utilized to guide the training of the student model.
  • Figure 3: Comparison of Sequence Overlap by Granularity. Sequence overlap between the corresponding chunks of student (TinyLlama) and teacher models differs significantly across varying levels of granularity (Number of Chunks). IoU (Intersection over Union) refers to the overlap ratio between the two sequences, while IoS (Intersection over Student sequence) denotes the coverage of the student sequence by the teacher sequence.
  • Figure 4: Performance Comparison Across Various Teacher Models. VocAgnoLM consistently outperforms logit distribution-based baselines.
  • Figure 5: Comparison of Performance Improvements Across Different Teachers. VocAgnoLM effectively mitigates vocabulary mismatch and leverages higher-performing teacher models to achieve significant performance gains, outperforming logit distribution-based baselines.
  • ...and 3 more figures