Table of Contents
Fetching ...

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, Iryna Gurevych

TL;DR

This work examines whether multilingual LLMs can serve as culturally diverse reasoners by evaluating their ability to memorize and reason with proverbs across six languages using the MAPS dataset. MAPS combines proverbs, conversational contexts, and interpretation labels to test reasoning under cultural common ground, distinguishing memorization from genuine contextual understanding and exploring cross-cultural gaps via translations. Findings show that while models scale up memorization, reasoning with figurative proverbs and cross-cultural interpretations remains weak, with pronounced culture gaps in translations. The study releases MAPS to enable rigorous evaluation of cross-cultural reasoning in open-source mLLMs and highlights the need for culturally informed multilingual training and evaluation approaches for inclusive cross-language understanding.

Abstract

Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in a situational context, human expectations vary depending on the relevant cultural common ground. As languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs "know" limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a "culture gap" in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

TL;DR

This work examines whether multilingual LLMs can serve as culturally diverse reasoners by evaluating their ability to memorize and reason with proverbs across six languages using the MAPS dataset. MAPS combines proverbs, conversational contexts, and interpretation labels to test reasoning under cultural common ground, distinguishing memorization from genuine contextual understanding and exploring cross-cultural gaps via translations. Findings show that while models scale up memorization, reasoning with figurative proverbs and cross-cultural interpretations remains weak, with pronounced culture gaps in translations. The study releases MAPS to enable rigorous evaluation of cross-cultural reasoning in open-source mLLMs and highlights the need for culturally informed multilingual training and evaluation approaches for inclusive cross-language understanding.

Abstract

Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in a situational context, human expectations vary depending on the relevant cultural common ground. As languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs "know" limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a "culture gap" in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.
Paper Structure (46 sections, 15 figures, 15 tables)

This paper contains 46 sections, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Proverbs are fixed expressions used by different cultures. We collect proverbs from six languages (top) and their usage within conversational contexts. We evaluate mLLMs with a binary-choice inference task in the conversational context that contains proverbs (bottom).
  • Figure 2: Visualizing proverb embeddings using kernel density estimation (KDE).
  • Figure 3: Performance of mLLMs on the proposed MAPS dataset. The number of parameters is in billions for LLaMA-2 and in millions for all other models.
  • Figure 4: Performance of mLLMs on the proposed MAPS - Inference task when asking the 'negative' question. The number of parameters is in billions for LLaMA-2 and in millions for all other models.
  • Figure 5: Performance gap between machine-translated, human-translated data and results in the original source language (Zh), and target language (En).
  • ...and 10 more figures