Table of Contents
Fetching ...

Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance

Manon Reusens, Philipp Borchert, Jochen De Weerdt, Bart Baesens

TL;DR

The paper investigates how the nativeness of English prompts affects large language model performance across three user groups: Western native (WN), non-Western native (NWN), and non-native (NN) English speakers. It introduces a newly collected dataset with 12,519 annotations from 124 annotators across ten instruction-based tasks, with translations into eight languages, and evaluates multiple chat-based LLMs. The findings show that native prompts, particularly from Western natives, yield higher accuracy on objective classification tasks but higher misalignment on subjective tasks, while generative tasks are comparatively robust to nativeness; an anchoring effect emerges when models are told a prompt writer’s nativeness, biasing outputs toward the indicated group. Results are model-dependent and underscore the importance of dataset diversity and prompt design in mitigating bias. The work contributes a large multilingual dataset, a systematic evaluation framework, and insights into designing LLMs that perform equitably across diverse English varieties.

Abstract

Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user's nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.

Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance

TL;DR

The paper investigates how the nativeness of English prompts affects large language model performance across three user groups: Western native (WN), non-Western native (NWN), and non-native (NN) English speakers. It introduces a newly collected dataset with 12,519 annotations from 124 annotators across ten instruction-based tasks, with translations into eight languages, and evaluates multiple chat-based LLMs. The findings show that native prompts, particularly from Western natives, yield higher accuracy on objective classification tasks but higher misalignment on subjective tasks, while generative tasks are comparatively robust to nativeness; an anchoring effect emerges when models are told a prompt writer’s nativeness, biasing outputs toward the indicated group. Results are model-dependent and underscore the importance of dataset diversity and prompt design in mitigating bias. The work contributes a large multilingual dataset, a systematic evaluation framework, and insights into designing LLMs that perform equitably across diverse English varieties.

Abstract

Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user's nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.

Paper Structure

This paper contains 37 sections, 2 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Two example prompts of a native and non-native English speaker and the corresponding output given by GPT4o, where Annotator Prompt represents the placeholder for the annotations. For the objective task, the model selects the wrong answer for the non-native English speaker, while semantically the same message was conveyed. While Sentence B from the non-native speaker ("People can't be eaten.") may seem different from Sentence A from the native speaker, it is a direct translation from the non-native speaker’s first language and conveys the same meaning from the non-native prompt writer’s perspective. This demonstrates how slight variations in phrasing, common among non-native speakers, can lead to misinterpretations or different model responses, despite semantic equivalence. For the subjective task, we see how the model estimates the native answer to be more positive than actually intended.
  • Figure 2: Methodology and experimental setup. The left part shows the data collection steps. After gathering the different datasets, study participants annotated the examples. Then we validated them and used them as input to generate LLM responses. The right part of the figure shows the evaluation phase, where we gathered the respective scores depending on the task.
  • Figure 3: WN is best-performing for objective classification tasks and worst-performing for the subjective classification tasks. The figure shows average model performance per group and task type averaged for all models and runs; y-axis is adjusted to 0.65–1 for clarity.
  • Figure 4: The generative tasks are more robust against native bias. This figure shows the average model performance for the generative tasks per group and metric averaged over the different runs. We rescaled the results so that they range from 0 to 1.
  • Figure 5: This figure shows the average performance for the different classification tasks per model and group. We see how both GPT models clearly prefer the Western native group, while the other models show similar preference for both native groups for the objective classification task. For the subjective classification tasks, the Western native group is the worst performing group for all models. We adjusted the y-axis to range from 0.65 to 1 for clarity.
  • ...and 14 more figures