Table of Contents
Fetching ...

Subword Tokenization Strategies for Kurdish Word Embeddings

Ali Salehi, Cassandra L. Jacobs

TL;DR

The paper investigates tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches, using a bootstrapped BiLSTM-CRF morph segmentation model and Word2Vec embeddings. It highlights that apparent advantages of BPE in morphological similarity arise from severe evaluation coverage biases, while morpheme-based tokenization yields more coherent embedding-space organization and better semantic structure when assessed fairly. The work emphasizes coverage-aware evaluation in low-resource, morphologically rich languages and suggests that hybrid tokenization approaches may better capture the full spectrum of Kurdish morphology. Overall, the findings guide practical tokenization choices for Kurdish NLP and similar low-resource languages, with implications for downstream tasks and language-model development.

Abstract

We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.

Subword Tokenization Strategies for Kurdish Word Embeddings

TL;DR

The paper investigates tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches, using a bootstrapped BiLSTM-CRF morph segmentation model and Word2Vec embeddings. It highlights that apparent advantages of BPE in morphological similarity arise from severe evaluation coverage biases, while morpheme-based tokenization yields more coherent embedding-space organization and better semantic structure when assessed fairly. The work emphasizes coverage-aware evaluation in low-resource, morphologically rich languages and suggests that hybrid tokenization approaches may better capture the full spectrum of Kurdish morphology. Overall, the findings guide practical tokenization choices for Kurdish NLP and similar low-resource languages, with implications for downstream tasks and language-model development.

Abstract

We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.

Paper Structure

This paper contains 31 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: BiLSTM-CRF morphological segmentation accuracy by part-of-speech category, showing substantial variation in boundary detection performance across linguistic categories.
  • Figure 2: Average similarity by neighbor rank, showing how similarity decreases across ranked nearest neighbors for each tokenization approach. BPE maintains higher similarities across all ranks while morpheme and word models show steeper decay patterns.
  • Figure 3: Distribution of cosine distances for intra-lemma (same lemma) versus inter-lemma (different lemmas) word pairs across tokenization approaches. Blue histograms show distances between inflected forms of the same lemma, while orange histograms show distances between forms of different lemmas.
  • Figure 4: Similarity score distributions showing BPE's concentration in high-similarity ranges versus word and morpheme models' broader coverage.
  • Figure 5: Dataset coverage in embeddings showing dramatic differences in evaluation coverage across tokenization approaches