Table of Contents
Fetching ...

Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection

Aso Mahmudi, Borja Herce, Demian Inostroza Amestica, Andreas Scherbakov, Eduard Hovy, Ekaterina Vylomova

TL;DR

The paper tackles the challenge of efficiently documenting endangered languages by introducing a Word-and-Paradigm–based neural framework that models linguist–speaker interactions during morphological data elicitation. It contrasts several active-learning sampling strategies and evaluates state-of-the-art inflection models across typologically diverse languages, finding that uniform sampling across paradigm cells generally yields stronger generalisation in low-resource settings, while incorporating the model’s confidence into annotation decisions improves interaction efficiency. The study provides a structured evaluation of data-efficiency metrics, including a Normalised Efficiency Score, and demonstrates that confidence-guided predictions can reduce elicitation cost without sacrificing accuracy. These insights offer a practical blueprint for ergonomic, data-efficient fieldwork and have potential to accelerate language documentation efforts in resource-constrained settings.

Abstract

Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.

Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection

TL;DR

The paper tackles the challenge of efficiently documenting endangered languages by introducing a Word-and-Paradigm–based neural framework that models linguist–speaker interactions during morphological data elicitation. It contrasts several active-learning sampling strategies and evaluates state-of-the-art inflection models across typologically diverse languages, finding that uniform sampling across paradigm cells generally yields stronger generalisation in low-resource settings, while incorporating the model’s confidence into annotation decisions improves interaction efficiency. The study provides a structured evaluation of data-efficiency metrics, including a Normalised Efficiency Score, and demonstrates that confidence-guided predictions can reduce elicitation cost without sacrificing accuracy. These insights offer a practical blueprint for ergonomic, data-efficient fieldwork and have potential to accelerate language documentation efforts in resource-constrained settings.

Abstract

Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
Paper Structure (20 sections, 1 equation, 5 figures, 3 tables)

This paper contains 20 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the proposed word elicitation process model.
  • Figure 2: A heatmap showing the accuracy of predictions for English verbs.
  • Figure 3: A simplified overview of sampling strategies used in the second cycle of the experiments. Blue cells represent samples retrieved without any predictions or confidence checks. Dark green cells denote confident ones retrieved with predictions, while dark red cells indicate low confidence cells with no predictions sent to the oracle. Orange cells indicate those that were selected in the first cycle and removed from the pool.
  • Figure 4: Accuracy on remaining pool data in each cycle of the active learning process for each language.
  • Figure 5: Submitted predictions along the requests to the oracle in each experiment. Exp.1 is omitted as all its requests were without a prediction.