Table of Contents
Fetching ...

GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, Kai-Wei Chang

TL;DR

GeoMLama introduces a geo-diverse commonsense probing benchmark for multilingual PLMs, pairing 3,125 prompts across five languages with five countries to assess cross-cultural knowledge. Using LAMA-style masked prompting, answer-candidate scoring, and prior-calibration, the study finds that larger models do not consistently outperform smaller ones and that native language is not always the best probe for a given country. The results reveal language- and country-dependent patterns, including intrinsic biases when country cues are removed and evidence of reporting bias shaping which knowledge is readily elicited. The work provides actionable benchmarks, a transparent methodology, and a data/code release to support future research on geo-diversity and multilingual commonsense.

Abstract

Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.

GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

TL;DR

GeoMLama introduces a geo-diverse commonsense probing benchmark for multilingual PLMs, pairing 3,125 prompts across five languages with five countries to assess cross-cultural knowledge. Using LAMA-style masked prompting, answer-candidate scoring, and prior-calibration, the study finds that larger models do not consistently outperform smaller ones and that native language is not always the best probe for a given country. The results reveal language- and country-dependent patterns, including intrinsic biases when country cues are removed and evidence of reporting bias shaping which knowledge is readily elicited. The work provides actionable benchmarks, a transparent methodology, and a data/code release to support future research on geo-diversity and multilingual commonsense.

Abstract

Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.
Paper Structure (33 sections, 4 equations, 5 figures, 13 tables)

This paper contains 33 sections, 4 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Examples of prompts and gold answers in GeoMLama. For each concept (e.g., color of wedding dress), there are multiple masked multilingual prompts (English, Hindi, Swahili, etc.) with specified country information rgb] .2, 1, 1[X] querying geo-diverse knowledge about the concept. We test multilingual PLMs by examining the extent to which masked word predictions align with the gold answers in rgb] 0, 1, 0[MASK] columns.
  • Figure 2: Overall annotation pipeline. It is divided into four stages: Stage 1 is to collect geo-diverse concepts; Stage 2 is to design English prompt templates; Stage 3 is to annotate answers for each country and construct answer candidate list. Stage 4 is to translate the English prompts and paraphrase the translated multilingual prompts. Here we showcase English and Hindi answer annotations for demonstration.
  • Figure 3: Multilingual PLMs' performance on probing knowledge about the studied countries averaged over all languages. Complete results are shown in Appendix \ref{['results-w']}.
  • Figure 4: Multilingual PLMs' performance averaged over countries when using multilingual prompts. "en", "zh", "hi", "fa", and "sw" denote English, Chinese, Hindi, Persian, and Swahili. Complete results are shown in Appendix \ref{['results-w']}.
  • Figure 5: Average performance of multilingual PLMs when fed with prompts without any specified country names. Complete results are shown in Appendix \ref{['results-wo']}.