Table of Contents
Fetching ...

GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

Michael Ginn, Lindia Tjuatja, Taiqi He, Enora Rice, Graham Neubig, Alexis Palmer, Lori Levin

TL;DR

This work compiles the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation and explores the task of automatically generating IGT in order to aid documentation projects.

Abstract

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.

GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

TL;DR

This work compiles the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation and explores the task of automatically generating IGT in order to aid documentation projects.

Abstract

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.
Paper Structure (46 sections, 8 figures, 12 tables)

This paper contains 46 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Components of interlinear gloss with an Arapaho sentence and English translation cowell2020. Blue boxes show transcriptions that are unsegmented (top) or segmented (bottom). Segmented text is split into morphemes which are aligned with the gloss labels shown in the green box. The task of automatic glossing uses some or all of the information in the gray box (transcription & translation) to generate the gloss line.
  • Figure 2: Distribution of unique glosses across all languages.
  • Figure 3: Comparison of our pretrained model and the SOTA girrbach-2023-sigmorphon for in-domain languages on unsegmented data. Our model outperforms on all three languages.
  • Figure 4: Morpheme accuracy for various systems.
  • Figure 5: Performance after monolingual finetuning, comparing a standard pretrained ByT5 with a continually pretrained GlossLM model. The x-axis uses the log (base 10) of the number of training examples in a given language, for readability.
  • ...and 3 more figures