Table of Contents
Fetching ...

Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

Samuel Cahyawijaya, Ruochen Zhang, Holy Lovenia, Jan Christian Blaise Cruz, Elisa Gilbert, Hiroki Nomoto, Alham Fikri Aji

TL;DR

This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench, and using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs.

Abstract

Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.

Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

TL;DR

This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench, and using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs.

Abstract

Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.

Paper Structure

This paper contains 65 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Our work explores two linguistic phenomena known as false friend and true cognate, and highlights the limitation of LLMs on understanding cognate indicating the pitfall on cross-lingual disambiguation.
  • Figure 2: Annotation and data formulation pipeline of StingrayBench. Our annotation consists of a 3-step process that requires two annotators, one for each language of the language pair. In addition, we provide the English translation of the correct sentence for better accessibility to StingrayBench.
  • Figure 3: Stingray plot is a 2D scatter plot where the X-axis and Y-axis represent the model performance on StingrayBench across each language. The cognate bias score towards a particular language is measured based on the angular distance of the data point (e.g. the model is unbiased if it has equally good performance for either language). The cognate comprehension score is measured based on the point's magnitude.
  • Figure 4: Stingray plot showcasing the performance of each LLM averaged across all language pairs and tasks. There is a different trend between the model performance on the (left) true cognate and (right) false friend subsets. LLMs showcase strong capability on true cognates, but close to random guessing on false friends. This highlights the inability of existing LLMs to disambiguate false friends across different languages.
  • Figure 5: Most LLMs understand true cognates, but have limited understanding in regards to false friends in language pairs under study. We report the averaged cognate comprehension scores across the semantic correctness and usage correctness tasks.
  • ...and 9 more figures