Labeled Morphological Segmentation with Semi-Markov Models
Ryan Cotterell, Thomas Müller, Alexander Fraser, Hinrich Schütze
TL;DR
This work introduces labeled morphological segmentation (LMS), a framework that attaches fine-grained morphotactic labels to morph segments and unifies three NLP tasks: segmentation, stemming, and morphological tag classification. It proposes Chipmunk, a semi-Markov CRF that explicitly models morphotactics and leverages a rich feature set (affix gazetteers, stem validity via spell-checkers, and cross-product features with morphotactic tags) across six languages. The approach yields consistent improvements over state-of-the-art baselines in all tasks, with notable gains in agglutinative languages and when using intermediate tagset granularities. By unifying analysis and guessing under a single probabilistic model, LMS enables robust handling of novel roots and affixes and provides a practical pathway for integrating nuanced morphological analysis into downstream NLP systems.
Abstract
We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk} yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2--6 points $F_1$ over the baseline.
