Table of Contents
Fetching ...

Morphological Typology in BPE Subword Productivity and Language Modeling

Iñigo Parra

TL;DR

A correlation between morphological typology and BPE tokenization efficiency is suggested and languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks.

Abstract

This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using the byte-pair encoding (BPE) algorithm. We compare the performance of models trained with similar amounts of data in different languages. Our experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks. We also observe that the typological continuum from linguistic theory is reflected in several experiments. These findings suggest a correlation between morphological typology and BPE tokenization efficiency.

Morphological Typology in BPE Subword Productivity and Language Modeling

TL;DR

A correlation between morphological typology and BPE tokenization efficiency is suggested and languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks.

Abstract

This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using the byte-pair encoding (BPE) algorithm. We compare the performance of models trained with similar amounts of data in different languages. Our experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks. We also observe that the typological continuum from linguistic theory is reflected in several experiments. These findings suggest a correlation between morphological typology and BPE tokenization efficiency.

Paper Structure

This paper contains 15 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of the pipeline. The input consists of parallel and independent corpora. After BPE tokenization, we compare performance on language modeling and compute the subword productivity.
  • Figure 2: Trends of subword repetition. As the sample increases, the lines form two groups that show distinct behaviors. Red lines represent synthetic languages; green lines represent analytic languages.
  • Figure 3: Productivity scores per language after averaging results for 300, 400, and 500 merge operations. Measurements were performed using the PBC parallel corpora. The error bars indicate the standard deviation between rounds of merge operations.
  • Figure 4: Frequencies of the top $n$-th most repeated subword. As observed in the graph, as further tokens are analyzed, the tokens in synthetic languages show higher frequencies.
  • Figure 5: Results of the training on independent corpora extracted from LCC (a) and validation in PBC corpora (b). Overall, synthetic languages performed better than their analytic counterparts. This is evidenced by lower and more consistent values.