Table of Contents
Fetching ...

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Francois Meyer, Jan Buys

TL;DR

This work investigates how subword boundaries evolve when tokenisation is learned jointly with language modelling, across morphologically diverse languages. The authors extend SSLM to a Transformer-based framework (T-SSLM) that supports end-to-end pretraining and finetuning with a learnable subword segmentation, tracking segmentation changes via Viterbi decoding. They identify four distinct subword-learning stages and find that morphologically complex isiXhosa exhibits greater instability, while Setswana converges earlier; finetuning tends to yield finer-grained subwords and task-specific realignment, improving isiXhosa data-to-text generation and cross-lingual transfer. The results highlight the potential of learnable subword segmentation to enhance low-resource, morphologically rich language generation, while also emphasising computational costs and the need for broader linguistic coverage in future work.

Abstract

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

TL;DR

This work investigates how subword boundaries evolve when tokenisation is learned jointly with language modelling, across morphologically diverse languages. The authors extend SSLM to a Transformer-based framework (T-SSLM) that supports end-to-end pretraining and finetuning with a learnable subword segmentation, tracking segmentation changes via Viterbi decoding. They identify four distinct subword-learning stages and find that morphologically complex isiXhosa exhibits greater instability, while Setswana converges earlier; finetuning tends to yield finer-grained subwords and task-specific realignment, improving isiXhosa data-to-text generation and cross-lingual transfer. The results highlight the potential of learnable subword segmentation to enhance low-resource, morphologically rich language generation, while also emphasising computational costs and the need for broader linguistic coverage in future work.

Abstract

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

Paper Structure

This paper contains 30 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Subword fertility (average subwords per word) gradually plateaus for isiXhosa, while converging early for English and Setswana.
  • Figure 2: Morphological boundary overlap between learned subwords and morphological segmentations. Pretraining is performed on Setswana/English/isiXhosa, while finetuning is performed on isiXhosa data-to-text.
  • Figure 3: The average productivity and idiosyncrasy of learned subwords. Pretraining is performed on isiXhosa and Setswana, respectively, while finetuning is conducted on isiXhosa data-to-text generation.
  • Figure 4: Fertility distributions (subwords per segmented word) across the four training stages identified in this study. All languages exhibit increasing fertility as pretraining progresses, with a more pronounced shift during finetuning. Changes are especially dramatic for isiXhosa, whose complex morphology leads to larger distributional shifts than Setswana or English.
  • Figure 5: Boundary overlap between isiXhosa morphemes and subwords learned by our T-SSLMs pretrained on Setswana (left) and English (right). During isiXhosa finetuning, the models adjusts their subword segmentation to align with the morphological boundaries of the new language.