Table of Contents
Fetching ...

Vocabulary Customization for Efficient Domain-Specific LLM Deployment

Christian Herold, Michael Kozielski, Nicholas Santavas, Yannick Versley, Shahram Khadivi

TL;DR

This work tackles tokenization inefficiency when applying autoregressive LLMs to domain-specific text by introducing a deterministic vocabulary-extension algorithm that adds domain tokens without increasing the maximum token count per input. The approach augments the Llama-3.1 tokenizer with domain-focused merges, initializes new embeddings by averaging related vectors, and fine-tunes the extended model, achieving up to 20% shorter inputs and 20–30% throughput gains while preserving overall quality. Experiments on a multilingual, production-grade e-commerce task suite show the model widely adopts the new tokens (≈98% usage on longer inputs) and that the extension is orthogonal to other efficiency techniques like quantization. The method provides a practical, backward-compatible pathway to faster, domain-tuned LLM deployment with potential applicability to other domains and model families.

Abstract

When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation.

Vocabulary Customization for Efficient Domain-Specific LLM Deployment

TL;DR

This work tackles tokenization inefficiency when applying autoregressive LLMs to domain-specific text by introducing a deterministic vocabulary-extension algorithm that adds domain tokens without increasing the maximum token count per input. The approach augments the Llama-3.1 tokenizer with domain-focused merges, initializes new embeddings by averaging related vectors, and fine-tunes the extended model, achieving up to 20% shorter inputs and 20–30% throughput gains while preserving overall quality. Experiments on a multilingual, production-grade e-commerce task suite show the model widely adopts the new tokens (≈98% usage on longer inputs) and that the extension is orthogonal to other efficiency techniques like quantization. The method provides a practical, backward-compatible pathway to faster, domain-tuned LLM deployment with potential applicability to other domains and model families.

Abstract

When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation.

Paper Structure

This paper contains 18 sections, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: We extend the Llama 3.1 tokenizer with new vocabulary entries and merge operations, which are specific to the e-commerce domain. The result is a much more efficient tokenization for e-commerce specific phrases, which significantly reduces the cost of running such models in production.
  • Figure 2: Impact of extending the Llama-3.1 tokenizer with e-commerce specific tokens. Shown is the average number of tokens needed to encode a document vs the number of new tokens added to the tokenizer. We compare our algorithm for tokenizer extension against yamaguchi2024can. Impact on tokenization of (a) Wikipedia articles; (b) downstream e-commerce tasks.
  • Figure 3: Impact of adding additional tokens to the Llama-3.1 8B model (and therefore increasing the size of embedding and projection matrices) on the speed of a single forward pass. Model is deployed via vLLM on a single H100 GPU.