Joint Lemmatization and Morphological Tagging with LEMMING
Thomas Muller, Ryan Cotterell, Alexander Fraser, Hinrich Schütze
TL;DR
LEM MING tackles lemmatization and fine-grained morphological tagging by proposing a token-level, dictionary-free joint model built as a log-linear lemmatizer and a higher-order CRF-based tagger. It leverages a novel edit-tree–driven candidate selection, a rich feature set, and a globally normalized objective to jointly predict lemmas and morpho-syntactic tags, with inference via belief propagation and SGD training. The approach yields state-of-the-art token-based lemmatization across six languages and shows substantial mutual benefits when tagging and lemmatization are learned jointly, including notable reductions in Czech lemma and tag–lemma errors. The work demonstrates that joint modeling with arbitrary global lemma features can improve both lemma and tag accuracy, offering practical gains for morphologically rich languages without reliance on external dictionaries or analyzers.
Abstract
We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
