Table of Contents
Fetching ...

Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

Mérilin Sousa Silva, Sina Ahmadi

TL;DR

The paper investigates whether pretrained language models can identify loanwords across 10 languages using the ConLoan dataset, addressing a gap where NLP systems show a bias toward loanwords. It employs two methods: prompting large language models in zero- and few-shot settings and fine-tuning multilingual encoders for BIO-based loanword span labeling. Across languages, LLMs struggle to distinguish loanwords from native vocabulary, with average F1 scores below 0.7 and Gemini leading at about 0.466, while fine-tuned encoders dramatically improve performance (0.648–0.851; XLM-R large reaching 0.8513). The study reveals persistent errors distinguishing loanwords from code-switches, named entities, and Greco-Latin terms, highlighting the need for deeper contextual and sociolinguistic grounding to support language preservation and minority-language NLP efforts.

Abstract

Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

TL;DR

The paper investigates whether pretrained language models can identify loanwords across 10 languages using the ConLoan dataset, addressing a gap where NLP systems show a bias toward loanwords. It employs two methods: prompting large language models in zero- and few-shot settings and fine-tuning multilingual encoders for BIO-based loanword span labeling. Across languages, LLMs struggle to distinguish loanwords from native vocabulary, with average F1 scores below 0.7 and Gemini leading at about 0.466, while fine-tuned encoders dramatically improve performance (0.648–0.851; XLM-R large reaching 0.8513). The study reveals persistent errors distinguishing loanwords from code-switches, named entities, and Greco-Latin terms, highlighting the need for deeper contextual and sociolinguistic grounding to support language preservation and minority-language NLP efforts.

Abstract

Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

Paper Structure

This paper contains 18 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: F1-scores of LLMs based on prompt, evaluation protocol, and fine-tuning setup. Overall, providing few shots with relaxed evaluation using Gemini yields higher F1-score for loanword identification.