Table of Contents
Fetching ...

Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

Duygu Altinok

TL;DR

This work addresses tokenization in Turkish, a morphologically rich language, by introducing a subwords manifest that systematically couples vocabulary size with tokenizer training data. It evaluates multiple tokenizer families (WordPiece, morphology-aware subwords, and character baselines) across semantic, syntactic, and morphology-sensitive tasks, enriched by a morphology-aware diagnostic toolkit (boundary F1, lemma integrity, CER/WER, affix coverage). The study demonstrates that mid–large WordPiece vocabularies (approximately 32k–52k) trained on mixed-domain data offer the most reliable accuracy–efficiency trade-offs, while morphology-aware subwords provide strong gains on morph-sensitive tasks and interpretable explanations. The findings guide practical tokenizer design for Turkish NLP and establish a reproducible framework for evaluating tokenizers in morphologically rich languages, with open-source tooling and datasets to facilitate adoption and further research.

Abstract

Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.

Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

TL;DR

This work addresses tokenization in Turkish, a morphologically rich language, by introducing a subwords manifest that systematically couples vocabulary size with tokenizer training data. It evaluates multiple tokenizer families (WordPiece, morphology-aware subwords, and character baselines) across semantic, syntactic, and morphology-sensitive tasks, enriched by a morphology-aware diagnostic toolkit (boundary F1, lemma integrity, CER/WER, affix coverage). The study demonstrates that mid–large WordPiece vocabularies (approximately 32k–52k) trained on mixed-domain data offer the most reliable accuracy–efficiency trade-offs, while morphology-aware subwords provide strong gains on morph-sensitive tasks and interpretable explanations. The findings guide practical tokenizer design for Turkish NLP and establish a reproducible framework for evaluating tokenizers in morphologically rich languages, with open-source tooling and datasets to facilitate adoption and further research.

Abstract

Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.
Paper Structure (74 sections, 15 equations, 16 figures, 15 tables)

This paper contains 74 sections, 15 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: CoLA with word‑level vocabularies: coverage efficiency and success. Left: achieving high token coverage requires retaining large fractions of the word list on train, with a different accumulation pattern on test. Right: increasing train token coverage does not improve CoLA performance and in fact trends downward, pointing to representation limits rather than coverage as the bottleneck.
  • Figure 2: SST‑2 with word‑level vocabularies: coverage efficiency and success. Left: coverage accumulates differently on train vs. test, with test saturating earlier. Right: performance exhibits an early “elbow,” reaching 85% accuracy by 80% train coverage and showing no gains with larger vocabularies.
  • Figure 3: Word‑level vocabulary efficiency and downstream NER performance. Left: achieving high coverage requires large fractions of the vocabulary, and test coverage accumulates more slowly than train, underscoring inefficiency and domain mismatch. Right: raising training token coverage from 75% to 100% yields only modest and unstable gains, with F1 saturating around 0.5.
  • Figure 4: BOUN word‑level vocabulary: coverage efficiency and downstream success. Left: achieving higher coverage requires retaining large fractions of the word list, and train–test coverage accumulates differently, evidencing distribution shift. Right: increasing train token coverage beyond 75% does not translate into meaningful gains for POS, dependency (LAS), or morphology; performance plateaus at low levels, highlighting the inefficiency of word‑level vocabularies on this morphologically rich dataset.
  • Figure 5: Explainability heatmaps for CoLA vs. SST‑2 with word-level vocabularies. CoLA reveals weak, scattered cues consistent with poor grammatical acceptability performance, whereas SST‑2 focuses on a compact set of sentiment-bearing tokens, explaining the early performance gains and subsequent plateau once these cues are covered.
  • ...and 11 more figures