Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

Eshaan Tanwar; Gayatri Oke; Tanmoy Chakraborty

Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

Eshaan Tanwar, Gayatri Oke, Tanmoy Chakraborty

TL;DR

This study probes how multilingual LLMs link orthography to meaning in bilingual word processing across cognates, non-cognates, and interlingual homographs for EN–ES, EN–FR, and EN–DE. By adapting psycholinguistic tasks into three prompts—word-pair disambiguation, semantic judgment, and semantically constrained sentence processing—the authors evaluate five open-source LLMs on 1,260 word-pair items. Results show strong cognate facilitation driven by orthography, but robust semantic retrieval remains elusive, with interlingual homographs often misinterpreted or disambiguated below chance; semantic judgments do not reliably predict disambiguation performance. These findings, interpreted through the BIA+ framework, highlight a gap in cross-lingual grounding and suggest the need for better cross-language alignment or grounding for reliable bilingual semantic processing in LLMs.

Abstract

Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning "sightless" in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning "present" in English but "poison" in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.

Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

TL;DR

Abstract

Paper Structure (9 sections, 3 equations, 8 figures, 5 tables)

This paper contains 9 sections, 3 equations, 8 figures, 5 tables.

Supplementary information
Significance study
Performance of shots and models in word disambiguation
Word pair disambiguation
Semantic judgment and word pair disambiguation correlation
Completely and partially identical cognates
Human Evaluation of LLM's response.
Correction type distribution
Homograph's distribution across correction type

Figures (8)

Figure 1: A working example to illustrate the three linguistic entities: cognates, non-cognates and interlingual homographs, using words from Late Latin (LA), English (EN), Spanish (ES), Proto-Germanic (PG), Latin (LL), and Old German (OG). A.Cognates: The English word 'Soup' and the Spanish word 'Sopa' share the same meaning (a liquid dish) and similar orthography, derived from the Latin root 'Suppa'. B.Non-cognates: The English word 'Pot' and the Latin word 'Olla' represent the same meaning (a clay vessel) but have different orthography, tracing their origins to distinct roots (Proto-Germanic: Puttaz and Late Latin: Olla). C.Interlingual Homographs: The English word 'Blank' and the Spanish word 'Blanco' differ in meaning (space/gap vs. goal/aim) but share the same orthographic form, derived from the Old German root 'Blankaz.' However, homographs don't need to share an etymological root hence they are marked by a dotted line in the figure.
Figure 2: (A) Word pair disambiguation accuracy. The figure shows the average performance of five multilingual LLMs in identifying word pairs to have the same meaning or not. The experiments are conducted on five random seed values, and the variations in performance on these runs are represented by error bars. All the models seem to perform better on cognates than non-cognate pairs, highlighting the role of cognate facilitation in LLMs. All models except BLOOMZ seem to perform poorly in disambiguating phonograph pairs compared to cognates; further showing their utilization of orthographical signals in the disambiguation task. (B) Class distribution of prediction for word pair disambiguation task. For each of the five models we plot their predicted label distribution with true label distribution. We observe a high bias in BLOOMZ prediction, it predicts 'False' for most of the cases, even though true label distribution has more 'True' cases. Other models show better balance in their predictions.
Figure 3: (A) Word language recognition. The top half of the figure represents the language of the word as identified by the LLM, while the bottom half shows the actual language of the word. The chords between the two halves show the relationship between the identified language and the actual language. In 73% of the cases, the model correctly identifies the language of the word. In the remaining cases, it assigns the language of the context to the word, interpreting the word's meaning based on surrounding linguistic cues. Notably, for German, there are exceptions where the model classifies certain words as Dutch or French, even though the context was in English. (B) Word meaning recognition. The top half of the figure represents the language of the word meaning used by the LLM, while the bottom half shows the actual language of the word. The chords between the two halves show the relationship between the LLM's utility of meaning and the word's actual meaning. For a majority of Spanish, German and French words the model selected the English meaning of the interlingual homograph, this shows model's ability to retrieve meaning is dependent on the context of the sentence. However, for English words, we don't note this ability to the same degree, English words are mostly assigned an English meaning without any consideration of context. (C) The interplay between word language recognition and word meaning. The top half of the figure represents the language of the word meaning used by the LLM, while the bottom half shows the language of the word the LLM recognised. The chords between the two halves show the relationship between the LLM's utility of meaning and language recognition. For a majority of Spanish, German and French words the model selected the English meaning of the interlingual homograph, this shows model's ability to retrieve meaning is dependent on the context of the sentence. However, for English words, we don't note this ability to the same degree -- English words are mostly assigned an English meaning without any consideration of context. (D) Model's correction ability: The figure illustrates the model's correction and sentence comprehension abilities. We observe that the majority of non-English sentences undergo Correction Type 2, where the model uses context cues to correct the homograph. In contrast, English sentences undergo no correction, as the model relies on the word's meaning to understand the sentence, even though the condition is congruent. Furthermore, the model labels the sentence as semantically meaningful in most cases, regardless of whether it has been corrected (En: English, Fr: French, Ge: German, Sp: Spanish, and Ni: Neither).
Figure 4: Word pair disambiguation accuracy. The figure shows the performance of five multilingual LLMs, on five seed values, in identifying word pairs to have the same meaning or not on three language pairs -- (A) English-German, (B) English-Spanish and (C) English-French. All the models seem to perform better on Cognates than on Non-cognates pairs, highlighting the role of cognate facilitation in LLMs. All models except BLOOMZ seem to perform poorly in disambiguating phonograph pairs as compared to cognates, further showing their utilization of orthographical signals in the disambiguation task.
Figure 5: Variation in performance across shots. (A) Cognates: We note the performance to improve with the number of shots and stabilize after two shots. LLaMA-3.1 consistently outperforms the other models, while LLaMA-2 shows more variation. (B) Non-cognates: The performance of models decreases with the number of shots; however, for LLaAMA-2, the performance increases after shot two. (C) Interlingual Homographs: Both LLaMA-3 and LLaMA-3.1 show steady improvement with additional shots, whereas LLaMA-2 exhibits substantial variability but achieves its highest performance by four shots.
...and 3 more figures

Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

TL;DR

Abstract

Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)