Table of Contents
Fetching ...

Crowdsourcing Lexical Diversity

Hadi Khalilia, Jahna Otterbacher, Gabor Bella, Shandy Darma, Fausto Giunchiglia

TL;DR

Lexical-semantic resources often reflect English-centric bias, missing culture-specific terms and cross-lingual lexical gaps. The paper introduces a pivot-free, bidirectional crowdsourcing methodology implemented in the LingoGap platform to identify lexical gaps and equivalents across language pairs in targeted semantic domains, demonstrated on English–Arabic and Indonesian–Banjarese food terminology. Native speakers outperformed current LLMs in capturing culturally nuanced meanings, validating the approach's effectiveness for low-resource languages and diverse dialects. The results show meaningful lexical-gap production and overlap metrics, suggesting a scalable path to more inclusive multilingual lexical resources with implications for MT, WSD, and cross-lingual NLP.

Abstract

Lexical-semantic resources (LSRs), such as online lexicons and wordnets, are fundamental to natural language processing applications as well as to fields such as linguistic anthropology and language preservation. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual lexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing platform facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.

Crowdsourcing Lexical Diversity

TL;DR

Lexical-semantic resources often reflect English-centric bias, missing culture-specific terms and cross-lingual lexical gaps. The paper introduces a pivot-free, bidirectional crowdsourcing methodology implemented in the LingoGap platform to identify lexical gaps and equivalents across language pairs in targeted semantic domains, demonstrated on English–Arabic and Indonesian–Banjarese food terminology. Native speakers outperformed current LLMs in capturing culturally nuanced meanings, validating the approach's effectiveness for low-resource languages and diverse dialects. The results show meaningful lexical-gap production and overlap metrics, suggesting a scalable path to more inclusive multilingual lexical resources with implications for MT, WSD, and cross-lingual NLP.

Abstract

Lexical-semantic resources (LSRs), such as online lexicons and wordnets, are fundamental to natural language processing applications as well as to fields such as linguistic anthropology and language preservation. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual lexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing platform facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Crowd Filtering using Alpha.
  • Figure 2: Crowdsourced Data Validation using Alpha.
  • Figure 3: Worker’s GUI showing the step of match selection using LingoGap.
  • Figure 4: Flowchart of the semantic-field filtering method to collect food words from a digital dictionary.
  • Figure 5: The overlap (percentage of shared lexicalizations) for English and Arabic languages.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1: Lexical Gap