Comparison of Current Approaches to Lemmatization: A Case Study in Estonian
Aleksei Dorkin, Kairit Sirts
TL;DR
This paper compares three Estonian lemmatization paradigms—generative encoder–decoder, pattern-based transformation, and rule-based Vabamorf—on two UD-derived corpora to assess performance and complementarity. The Generative model consistently yields the highest token-based accuracy across EDT and EWT, while the Pattern-based and rule-based systems show weaker performance but offer complementary error patterns. Analysis reveals a relatively small overlap in errors among the three approaches (e.g., 5.8%), suggesting that ensemble methods could leverage their different strengths. The study also demonstrates the impact of data preprocessing (casing, symbol removal) and cross-domain generalization, highlighting practical implications for morphological processing in morphologically rich languages.
Abstract
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.
