Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Aleksei Dorkin; Kairit Sirts

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Aleksei Dorkin, Kairit Sirts

TL;DR

This paper compares three Estonian lemmatization paradigms—generative encoder–decoder, pattern-based transformation, and rule-based Vabamorf—on two UD-derived corpora to assess performance and complementarity. The Generative model consistently yields the highest token-based accuracy across EDT and EWT, while the Pattern-based and rule-based systems show weaker performance but offer complementary error patterns. Analysis reveals a relatively small overlap in errors among the three approaches (e.g., 5.8%), suggesting that ensemble methods could leverage their different strengths. The study also demonstrates the impact of data preprocessing (casing, symbol removal) and cross-domain generalization, highlighting practical implications for morphological processing in morphologically rich languages.

Abstract

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

TL;DR

Abstract

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Authors

TL;DR

Abstract

Table of Contents

Figures (2)