Table of Contents
Fetching ...

Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell, Thomas Müller, Alexander Fraser, Hinrich Schütze

TL;DR

This work introduces labeled morphological segmentation (LMS), a framework that attaches fine-grained morphotactic labels to morph segments and unifies three NLP tasks: segmentation, stemming, and morphological tag classification. It proposes Chipmunk, a semi-Markov CRF that explicitly models morphotactics and leverages a rich feature set (affix gazetteers, stem validity via spell-checkers, and cross-product features with morphotactic tags) across six languages. The approach yields consistent improvements over state-of-the-art baselines in all tasks, with notable gains in agglutinative languages and when using intermediate tagset granularities. By unifying analysis and guessing under a single probabilistic model, LMS enables robust handling of novel roots and affixes and provides a practical pathway for integrating nuanced morphological analysis into downstream NLP systems.

Abstract

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk} yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2--6 points $F_1$ over the baseline.

Labeled Morphological Segmentation with Semi-Markov Models

TL;DR

This work introduces labeled morphological segmentation (LMS), a framework that attaches fine-grained morphotactic labels to morph segments and unifies three NLP tasks: segmentation, stemming, and morphological tag classification. It proposes Chipmunk, a semi-Markov CRF that explicitly models morphotactics and leverages a rich feature set (affix gazetteers, stem validity via spell-checkers, and cross-product features with morphotactic tags) across six languages. The approach yields consistent improvements over state-of-the-art baselines in all tasks, with notable gains in agglutinative languages and when using intermediate tagset granularities. By unifying analysis and guessing under a single probabilistic model, LMS enables robust handling of novel roots and affixes and provides a practical pathway for integrating nuanced morphological analysis into downstream NLP systems.

Abstract

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk} yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2--6 points over the baseline.
Paper Structure (25 sections, 2 equations, 3 figures, 10 tables)

This paper contains 25 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Examples of the tasks addressed for the Turkish word gençleşmelerin ('of the rejuvenatings'): Traditional unlabeled segmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphological tag classification. The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.
  • Figure 2: Example of the different morphotactic tagset granularities for German Enteisungen 'defrostings'.
  • Figure 3: This figure represents a comparative analysis of undersegmentation. Each column (labels at the bottom) shows how often CRF-Morph +LSV (top number in heatmap) and Chipmunk (bottom number in heatmap) select a segment that is two separate segments in the gold standard. E.g., Rt-Sx indicates how a root and a suffix were treated as a single segment. The color depends on the difference of the two counts.