Table of Contents
Fetching ...

Multilingual Prompting for Improving LLM Generation Diversity

Qihan Wang, Shidong Pan, Tal Linzen, Emily Black

TL;DR

This work identifies the lack of cultural and demographic diversity in LLM generations and proposes multilingual prompting as a principled method to activate culture-specific knowledge across languages. By creating multiple prompt variants in different languages with cultural cues and aggregating their responses, the method yields higher diversity than prior approaches while preserving factual accuracy. Language alignment also reduces culture-specific hallucinations, and the diversity gains scale with the number of languages and vary by model size and resource level. The results suggest multilingual prompting as a practical, scalable technique to elicit broader perspectives from LLMs for diverse and more representative outputs.

Abstract

Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and persona prompting. Further analyses show that the benefits of multilingual prompting vary between high and low resource languages and across model sizes, and that aligning the prompting language with cultural cues reduces hallucination about culturally-specific information.

Multilingual Prompting for Improving LLM Generation Diversity

TL;DR

This work identifies the lack of cultural and demographic diversity in LLM generations and proposes multilingual prompting as a principled method to activate culture-specific knowledge across languages. By creating multiple prompt variants in different languages with cultural cues and aggregating their responses, the method yields higher diversity than prior approaches while preserving factual accuracy. Language alignment also reduces culture-specific hallucinations, and the diversity gains scale with the number of languages and vary by model size and resource level. The results suggest multilingual prompting as a practical, scalable technique to elicit broader perspectives from LLMs for diverse and more representative outputs.

Abstract

Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and persona prompting. Further analyses show that the benefits of multilingual prompting vary between high and low resource languages and across model sizes, and that aligning the prompting language with cultural cues reduces hallucination about culturally-specific information.

Paper Structure

This paper contains 48 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An example of the diversity of an LLM's (GPT-4o) responses when prompted in English versus in multiple languages: on the left, we show demographic diversity, specifically the range of different nationalities represented in an answer about which singers to follow; on the right, we show the level of agreement with a controversial social norms statement. We measure diversity by calculating the (normalized) entropy of model responses, explained in more detail in Section \ref{['sec:metrics']}. Multilingual prompting leads to an increase in diversity.
  • Figure 2: Above: an overview of multilingual and multicultural prompting, and our diversity evaluation. Below: example prompts from our multilingual and multicultural methods, and a subset of methods we compare to.
  • Figure 3: Diversity comparison for GPT-4o and GPT-4o-mini across multilingual methods.
  • Figure 4: Error rates of Chinese names generated under two prompting strategies. Using multilingual prompts in Chinese yields a lower error rate compared to multicultural prompts (cultural cues but without including the relevant language) in English, demonstrating that prompting in the relevant language reduces hallucination.
  • Figure 5: Prompts for social norm questions
  • ...and 5 more figures