Table of Contents
Fetching ...

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Aleksei Dorkin, Kairit Sirts

TL;DR

This paper compares three Estonian lemmatization paradigms—generative encoder–decoder, pattern-based transformation, and rule-based Vabamorf—on two UD-derived corpora to assess performance and complementarity. The Generative model consistently yields the highest token-based accuracy across EDT and EWT, while the Pattern-based and rule-based systems show weaker performance but offer complementary error patterns. Analysis reveals a relatively small overlap in errors among the three approaches (e.g., 5.8%), suggesting that ensemble methods could leverage their different strengths. The study also demonstrates the impact of data preprocessing (casing, symbol removal) and cross-domain generalization, highlighting practical implications for morphological processing in morphologically rich languages.

Abstract

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

TL;DR

This paper compares three Estonian lemmatization paradigms—generative encoder–decoder, pattern-based transformation, and rule-based Vabamorf—on two UD-derived corpora to assess performance and complementarity. The Generative model consistently yields the highest token-based accuracy across EDT and EWT, while the Pattern-based and rule-based systems show weaker performance but offer complementary error patterns. Analysis reveals a relatively small overlap in errors among the three approaches (e.g., 5.8%), suggesting that ensemble methods could leverage their different strengths. The study also demonstrates the impact of data preprocessing (casing, symbol removal) and cross-domain generalization, highlighting practical implications for morphological processing in morphologically rich languages.

Abstract

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.
Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Schematic representations of the generative and pattern-based approaches.
  • Figure 2: Venn diagram of the token-level lemmatization errors made by each model on the EDT validation set.