Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva, Sina Ahmadi
TL;DR
The paper investigates whether pretrained language models can identify loanwords across 10 languages using the ConLoan dataset, addressing a gap where NLP systems show a bias toward loanwords. It employs two methods: prompting large language models in zero- and few-shot settings and fine-tuning multilingual encoders for BIO-based loanword span labeling. Across languages, LLMs struggle to distinguish loanwords from native vocabulary, with average F1 scores below 0.7 and Gemini leading at about 0.466, while fine-tuned encoders dramatically improve performance (0.648–0.851; XLM-R large reaching 0.8513). The study reveals persistent errors distinguishing loanwords from code-switches, named entities, and Greco-Latin terms, highlighting the need for deeper contextual and sociolinguistic grounding to support language preservation and minority-language NLP efforts.
Abstract
Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.
