Zero-Shot Tokenizer Transfer

Benjamin Minixhofer; Edoardo Maria Ponti; Ivan Vulić

Zero-Shot Tokenizer Transfer

Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić

TL;DR

This work introduces Zero-Shot Tokenizer Transfer (ZeTT), a framework that detaches language models from their fixed tokenizers by training a hypernetwork to predict embeddings for arbitrary new tokenizers without observing data. The hypernetwork learns from a diverse tokenizer distribution, enabling zero-shot transfer and rapid adaptation with limited continued training, applicable to both encoder and decoder LMs and transferable to fine-tuned variants. Empirical results show significant efficiency gains (shorter token sequences) with accuracy close to the original models on cross-lingual and coding benchmarks, and substantial improvements over heuristic baselines. The approach lays groundwork for flexible, tokenizer-agnostic LM deployment and reusability of adapters across tokenizer changes, albeit with notable computational costs for training the base hypernetwork.

Abstract

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

Zero-Shot Tokenizer Transfer

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 6 figures, 14 tables, 1 algorithm.

Introduction
Background
Methodology
Hypernetwork Training
Hypernetwork Architecture
Experiments
Setup
Zero-Shot and n-shot Results
Applying a Hypernetwork trained for a Base Model to Fine-Tuned Models
Discussion
Conclusion
Limitations
Unigramifying: Approximating Arbitrary Tokenizers via UnigramLM
Stabilization Effect of the Auxiliary Loss
Non-Amortizing Hypernetworks
...and 8 more sections

Figures (6)

Figure 1: The hypernetwork predicts input and output embeddings based on the tokenizer.
Figure 2: The hypernetwork consists of a language model $\mathrm{HLM}_\theta$ learning to compose embeddings under the original tokenization into a new embedding and amortizes over the tokenization function.
Figure 3: Language modeling loss of GPT2, and GPT2 with untied weight embeddings with and without the auxiliary loss across the first 50k training steps, excluding MIMICK-style warmup.
Figure 4: Difference in accuracy to the original XLM-R model on XNLI of our method and FOCUS across vocabularies with size 30k, 50k, and 100k of the new tokenizer.
Figure 5: Correlation of the difference in accuracy to the original XLM-R model with Unigram overlap probability $p(\text{overlap})$ (left) and vocabulary overlap (right).
...and 1 more figures

Zero-Shot Tokenizer Transfer

TL;DR

Abstract

Zero-Shot Tokenizer Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (6)