Table of Contents
Fetching ...

Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Jonathan Sakunkoo, Annabella Sakunkoo

TL;DR

The paper investigates the reliability of crowd-sourced morphological knowledge about defectivity in Latin and Italian by deploying a scalable pipeline that trains a neural morphological analyzer on UD treebanks, annotates massive CC-100 corpora, and validates Wiktionary defectivity lists using Indirect Negative Evidence. A log-odds divergence metric $L_w=\log\left(\dfrac{p_w}{p_l p_f}\right)$ with a threshold of $>1.9$ is used to quantify non-defectivity and detect inconsistencies. The results show Italian defectivity entries align with corpus usage at roughly $80\%$, while Latin shows more discrepancies, with about $6$–$7\%$ of Wiktionary-listed defective lemmata likely non-defective. This work demonstrates a scalable approach to quality assurance for crowd-sourced linguistic data and highlights both the value and limitations of Wiktionary for non-English morphologies, with practical implications for improving morphologically aware NLP systems.

Abstract

Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.

Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

TL;DR

The paper investigates the reliability of crowd-sourced morphological knowledge about defectivity in Latin and Italian by deploying a scalable pipeline that trains a neural morphological analyzer on UD treebanks, annotates massive CC-100 corpora, and validates Wiktionary defectivity lists using Indirect Negative Evidence. A log-odds divergence metric with a threshold of is used to quantify non-defectivity and detect inconsistencies. The results show Italian defectivity entries align with corpus usage at roughly , while Latin shows more discrepancies, with about of Wiktionary-listed defective lemmata likely non-defective. This work demonstrates a scalable approach to quality assurance for crowd-sourced linguistic data and highlights both the value and limitations of Wiktionary for non-English morphologies, with practical implications for improving morphologically aware NLP systems.

Abstract

Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.

Paper Structure

This paper contains 11 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Workflow for computational validation of morphological gaps, using UDTube