Table of Contents
Fetching ...

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito

TL;DR

KL3M tackles inefficiencies and semantic fragmentation in general-purpose tokenizers for professional domains by introducing two tokenizer families: domain-specific BPEs (64K/128K) and character-level BPEs (4K/8K/16K) trained on a copyright-free legal-financial-government corpus. The approach delivers tangible gains in tokenization efficiency and domain-term representation, enabling longer context windows and lower compute, while preserving crucial domain semantics such as legal citations and financial abbreviations. Extensive evaluation across five domain datasets shows average improvements around 9% in tokens-per-character and notable reductions in domain-term token counts, with especially strong performance on US Code and SEC filings. All tokenizers and training code are publicly available on GitHub and Hugging Face, enabling reproducibility and practical adoption for law, finance, and OCR-driven workflows.

Abstract

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

TL;DR

KL3M tackles inefficiencies and semantic fragmentation in general-purpose tokenizers for professional domains by introducing two tokenizer families: domain-specific BPEs (64K/128K) and character-level BPEs (4K/8K/16K) trained on a copyright-free legal-financial-government corpus. The approach delivers tangible gains in tokenization efficiency and domain-term representation, enabling longer context windows and lower compute, while preserving crucial domain semantics such as legal citations and financial abbreviations. Extensive evaluation across five domain datasets shows average improvements around 9% in tokens-per-character and notable reductions in domain-term token counts, with especially strong performance on US Code and SEC filings. All tokenizers and training code are publicly available on GitHub and Hugging Face, enabling reproducibility and practical adoption for law, finance, and OCR-driven workflows.

Abstract

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.

Paper Structure

This paper contains 40 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Relationship between vocabulary size and tokenization efficiency. KL3M tokenizers achieve superior efficiency despite having comparable vocabulary sizes to other tokenizers.
  • Figure 2: Percentage of vocabulary by token length across tokenizers. KL3M tokenizers show higher percentages of medium-length tokens (3-6 characters) compared to other tokenizers.
  • Figure 3: Tokenization efficiency (tokens per character) across datasets. Lower values indicate higher efficiency. KL3M tokenizers consistently demonstrate higher efficiency, particularly for domain-specific content like US Code and Congressional Hearings.