Table of Contents
Fetching ...

Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Changbing Yang, Garrett Nicolai

TL;DR

The paper tackles low-resource canonical morpheme segmentation by predicting both segmentation and gloss from orthography using a transformer-based multitask model, augmented with LLM-generated synthetic data. It formalizes a joint objective $\mathcal{L}_{total} = \lambda \mathcal{L}_{seg} + (1 - \lambda) \mathcal{L}_{gloss}$ and leverages in-context prompts with GPT-4o to broaden morphological coverage. Evaluations on the SIGMORPHON 2023 dataset across multiple languages show that multitask learning improves word-level accuracy and morpheme-level F1, with additional gains when synthetic data are included, particularly in data-scarce languages. These results advance linguistic documentation efforts by enabling more robust automated IGT annotation while reducing reliance on extensive expert supervision.

Abstract

We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

TL;DR

The paper tackles low-resource canonical morpheme segmentation by predicting both segmentation and gloss from orthography using a transformer-based multitask model, augmented with LLM-generated synthetic data. It formalizes a joint objective and leverages in-context prompts with GPT-4o to broaden morphological coverage. Evaluations on the SIGMORPHON 2023 dataset across multiple languages show that multitask learning improves word-level accuracy and morpheme-level F1, with additional gains when synthetic data are included, particularly in data-scarce languages. These results advance linguistic documentation efforts by enabling more robust automated IGT annotation while reducing reliance on extensive expert supervision.

Abstract

We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Paper Structure

This paper contains 17 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The average learning curves for the F1 (top) and Accuracy (bottom) metrics.
  • Figure 2: The learning curves for the F1 (top) and Accuracy (bottom) metrics among all languages.