Table of Contents
Fetching ...

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

Ehsaneddin Asgari, Yassine El Kheir, Mohammad Ali Sadraei Javaheri

TL;DR

This work tackles the suboptimal morpheme alignment of Byte Pair Encoding (BPE) in morphologically diverse languages by introducing MorphBPE, a morphology-aware extension that prevents merges across morpheme boundaries while remaining compatible with existing LLM pipelines. It also introduces linguistically grounded intrinsic metrics, namely the Morph.-Edit Distance Score ($\mu_e$) and Morph.-Consistency F1-Score ($\mu_c$), to evaluate tokenization quality. Empirical studies on English, Russian, Hungarian, and Arabic with 300M and 1B parameter models show that MorphBPE lowers cross-entropy loss and accelerates convergence, with stronger improvements in morphologically complex languages. The method yields more interpretable tokenizations and can be integrated with minimal changes to current training setups, accompanied by public tooling and a focus on multilingual morphologies.

Abstract

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

TL;DR

This work tackles the suboptimal morpheme alignment of Byte Pair Encoding (BPE) in morphologically diverse languages by introducing MorphBPE, a morphology-aware extension that prevents merges across morpheme boundaries while remaining compatible with existing LLM pipelines. It also introduces linguistically grounded intrinsic metrics, namely the Morph.-Edit Distance Score () and Morph.-Consistency F1-Score (), to evaluate tokenization quality. Empirical studies on English, Russian, Hungarian, and Arabic with 300M and 1B parameter models show that MorphBPE lowers cross-entropy loss and accelerates convergence, with stronger improvements in morphologically complex languages. The method yields more interpretable tokenizations and can be integrated with minimal changes to current training setups, accompanied by public tooling and a focus on multilingual morphologies.

Abstract

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org

Paper Structure

This paper contains 15 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of morphological distance and fertility rate for BPE and $MorphBPE$ across four languages.
  • Figure 2: Overview of the $MorphBPE$ study: We evaluate the effectiveness of $MorphBPE$ over vanilla BPE across four morphologically diverse languages (English, Russian, Hungarian, and Arabic) by aligning vocabulary size with morphological segmentation. The we evaluate the tokenizers using the intrinsic evaluation metrics.
  • Figure 3: Comparison of training cross-entropy loss between BPE and $MorphBPE$ across four languages. Results are shown for both the small (300M) and large (1B) models.