Evaluating Shortest Edit Script Methods for Contextual Lemmatization
Olia Toporkov, Rodrigo Agerri
TL;DR
The paper addresses how different Shortest Edit Script (SES) representations influence contextual lemmatization by casting lemmatization as token classification with SES labels. It systematically compares three SES induction methods—UDPipe, IXA pipes, and Morpheus—across seven morphologically diverse languages using both multilingual and language-specific masked language models as backbones. Results show that separating casing from edit operations and employing suffix-focused edits (UDPipe) yields the strongest lemmatization performance, particularly for highly inflected languages, while Morpheus underperforms and IXA pipes is competitive in certain settings; multilingual pretrained models consistently outperform language-specific ones. The study also analyzes generalization to out-of-domain words and model contamination, providing practical guidance for SES design in future contextual lemmatizers and offering public resources for reproducibility.
Abstract
Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES), namely, the number of edit operations to transform a word form into its lemma. In fact, different methods of computing SES have been proposed as an integral component in the architecture of several state-of-the-art contextual lemmatizers currently available. However, previous work has not investigated the direct impact of SES in the final lemmatization performance. In this paper we address this issue by focusing on lemmatization as a token classification task where the only input that the model receives is the word-label pairs in context, where the labels correspond to previously induced SES. Thus, by modifying in our lemmatization system only the SES labels that the model needs to learn, we may then objectively conclude which SES representation produces the best lemmatization results. We experiment with seven languages of different morphological complexity, namely, English, Spanish, Basque, Russian, Czech, Turkish and Polish, using multilingual and language-specific pre-trained masked language encoder-only models as a backbone to build our lemmatizers. Comprehensive experimental results, both in- and out-of-domain, indicate that computing the casing and edit operations separately is beneficial overall, but much more clearly for languages with high-inflected morphology. Notably, multilingual pre-trained language models consistently outperform their language-specific counterparts in every evaluation setting.
