Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection
Aso Mahmudi, Borja Herce, Demian Inostroza Amestica, Andreas Scherbakov, Eduard Hovy, Ekaterina Vylomova
TL;DR
The paper tackles the challenge of efficiently documenting endangered languages by introducing a Word-and-Paradigm–based neural framework that models linguist–speaker interactions during morphological data elicitation. It contrasts several active-learning sampling strategies and evaluates state-of-the-art inflection models across typologically diverse languages, finding that uniform sampling across paradigm cells generally yields stronger generalisation in low-resource settings, while incorporating the model’s confidence into annotation decisions improves interaction efficiency. The study provides a structured evaluation of data-efficiency metrics, including a Normalised Efficiency Score, and demonstrates that confidence-guided predictions can reduce elicitation cost without sacrificing accuracy. These insights offer a practical blueprint for ergonomic, data-efficient fieldwork and have potential to accelerate language documentation efforts in resource-constrained settings.
Abstract
Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
