Table of Contents
Fetching ...

Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

TL;DR

The paper investigates morphologically-guided tokenization for Latin to address the misalignment of standard tokenization with morphology in morphologically rich languages. It proposes three tokenization strategies that integrate morphological information (MorphSeeding and MorphPreTokenization, with contextual and acontextual variants) and evaluates eight Latin RoBERTa models on POS/Morph tagging, NER, WSD, and AV. The study shows that morphology-aware tokenizers, especially MorphPreTokenization, improve morphological feature tagging and NER, and enhance out-of-domain generalization, albeit with task-dependent effects on WSD and AV. The findings underscore the value of high-quality linguistic resources for low-resource or morphologically complex languages and suggest that morphology-aligned tokenization can meaningfully boost LM performance when raw data are limited. Overall, the work advocates for integrating linguistic knowledge into tokenization and resource development to advance NLP for Latin and similar languages.

Abstract

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

Contextual morphologically-guided tokenization for Latin encoder models

TL;DR

The paper investigates morphologically-guided tokenization for Latin to address the misalignment of standard tokenization with morphology in morphologically rich languages. It proposes three tokenization strategies that integrate morphological information (MorphSeeding and MorphPreTokenization, with contextual and acontextual variants) and evaluates eight Latin RoBERTa models on POS/Morph tagging, NER, WSD, and AV. The study shows that morphology-aware tokenizers, especially MorphPreTokenization, improve morphological feature tagging and NER, and enhance out-of-domain generalization, albeit with task-dependent effects on WSD and AV. The findings underscore the value of high-quality linguistic resources for low-resource or morphologically complex languages and suggest that morphology-aligned tokenization can meaningfully boost LM performance when raw data are limited. Overall, the work advocates for integrating linguistic knowledge into tokenization and resource development to advance NLP for Latin and similar languages.

Abstract

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

Paper Structure

This paper contains 56 sections, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Word frequency in the pretraining corpus versus whole-string morphological accuracy, for ULM (top) and WordPiece (bottom).