Table of Contents
Fetching ...

Joint Lemmatization and Morphological Tagging with LEMMING

Thomas Muller, Ryan Cotterell, Alexander Fraser, Hinrich Schütze

TL;DR

LEM MING tackles lemmatization and fine-grained morphological tagging by proposing a token-level, dictionary-free joint model built as a log-linear lemmatizer and a higher-order CRF-based tagger. It leverages a novel edit-tree–driven candidate selection, a rich feature set, and a globally normalized objective to jointly predict lemmas and morpho-syntactic tags, with inference via belief propagation and SGD training. The approach yields state-of-the-art token-based lemmatization across six languages and shows substantial mutual benefits when tagging and lemmatization are learned jointly, including notable reductions in Czech lemma and tag–lemma errors. The work demonstrates that joint modeling with arbitrary global lemma features can improve both lemma and tag accuracy, offering practical gains for morphologically rich languages without reliance on external dictionaries or analyzers.

Abstract

We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.

Joint Lemmatization and Morphological Tagging with LEMMING

TL;DR

LEM MING tackles lemmatization and fine-grained morphological tagging by proposing a token-level, dictionary-free joint model built as a log-linear lemmatizer and a higher-order CRF-based tagger. It leverages a novel edit-tree–driven candidate selection, a rich feature set, and a globally normalized objective to jointly predict lemmas and morpho-syntactic tags, with inference via belief propagation and SGD training. The approach yields state-of-the-art token-based lemmatization across six languages and shows substantial mutual benefits when tagging and lemmatization are learned jointly, including notable reductions in Czech lemma and tag–lemma errors. The work demonstrates that joint modeling with arbitrary global lemma features can improve both lemma and tag accuracy, offering practical gains for morphologically rich languages without reliance on external dictionaries or analyzers.

Abstract

We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
Paper Structure (15 sections, 2 equations, 3 figures, 6 tables)

This paper contains 15 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Edit tree for the inflected form umgeschaut "looked around" and its lemma umschauen "to look around". The right tree is the actual edit tree we use in our model, the left tree visualizes what each node corresponds to. The root node stores the length of the prefix umge (4) and the suffix t (1).
  • Figure 2: Our model is a 2nd-order linear-chain CRF augmented to predict lemmata. We heavily prune our model and can easily exploit higher-order ($>$2) tag dependencies.
  • Figure 3: Edit tree for the inflected form umgeschaut "looked around" and its lemma umschauen "to look around". The right tree is the actual edit tree we use in our model, the left tree visualizes what each node corresponds to. Note how the root node stores the length of the prefix umge and the suffix t.