Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell; Thomas Müller; Alexander Fraser; Hinrich Schütze

Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell, Thomas Müller, Alexander Fraser, Hinrich Schütze

TL;DR

This work introduces labeled morphological segmentation (LMS), a framework that attaches fine-grained morphotactic labels to morph segments and unifies three NLP tasks: segmentation, stemming, and morphological tag classification. It proposes Chipmunk, a semi-Markov CRF that explicitly models morphotactics and leverages a rich feature set (affix gazetteers, stem validity via spell-checkers, and cross-product features with morphotactic tags) across six languages. The approach yields consistent improvements over state-of-the-art baselines in all tasks, with notable gains in agglutinative languages and when using intermediate tagset granularities. By unifying analysis and guessing under a single probabilistic model, LMS enables robust handling of novel roots and affixes and provides a practical pathway for integrating nuanced morphological analysis into downstream NLP systems.

Abstract

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk} yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2--6 points $F_1$ over the baseline.

Labeled Morphological Segmentation with Semi-Markov Models

TL;DR

Abstract

over the baseline.

Paper Structure (25 sections, 2 equations, 3 figures, 10 tables)

This paper contains 25 sections, 2 equations, 3 figures, 10 tables.

Introduction
Paper Outline.
Labeled Segmentation and Tagset
Model
Features
Affix Features and Gazetteers.
Stem Features.
Integrating the Features.
Related Work
Memory-based Learning.
Unsupervised UMS.
Supervised UMS.
Chinese Word Segmentation.
Experiments
UMS Experiments
...and 10 more sections

Figures (3)

Figure 1: Examples of the tasks addressed for the Turkish word gençleşmelerin ('of the rejuvenatings'): Traditional unlabeled segmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphological tag classification. The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.
Figure 2: Example of the different morphotactic tagset granularities for German Enteisungen 'defrostings'.
Figure 3: This figure represents a comparative analysis of undersegmentation. Each column (labels at the bottom) shows how often CRF-Morph +LSV (top number in heatmap) and Chipmunk (bottom number in heatmap) select a segment that is two separate segments in the gold standard. E.g., Rt-Sx indicates how a root and a suffix were treated as a single segment. The color depends on the difference of the two counts.

Labeled Morphological Segmentation with Semi-Markov Models

TL;DR

Abstract

Labeled Morphological Segmentation with Semi-Markov Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)