Table of Contents
Fetching ...

Unintended Impacts of LLM Alignment on Global Representation

Michael J. Ryan, William Held, Diyi Yang

TL;DR

This study investigates unintended consequences of aligning LLMs with user preferences on global representation across three axes: English dialects, multilingualism, and opinions about and from countries. By tracking two-stage alignment (SFT followed by PT) across nine open LLMs, it reveals that alignment often improves task performance yet increases disparities between dialects, enhances multilingual performance in many languages, and biases models toward US opinions. The work further probes reward model signals and demonstrates that OOD country opinions are largely shaped by pretraining data rather than reward signals, underscoring the need for transparency and careful data design in preference tuning. The findings inform practical guidance for equitable alignment practices and contribute to the broader governance conversation around global accessibility and fairness of AI systems.

Abstract

Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.

Unintended Impacts of LLM Alignment on Global Representation

TL;DR

This study investigates unintended consequences of aligning LLMs with user preferences on global representation across three axes: English dialects, multilingualism, and opinions about and from countries. By tracking two-stage alignment (SFT followed by PT) across nine open LLMs, it reveals that alignment often improves task performance yet increases disparities between dialects, enhances multilingual performance in many languages, and biases models toward US opinions. The work further probes reward model signals and demonstrates that OOD country opinions are largely shaped by pretraining data rather than reward signals, underscoring the need for transparency and careful data design in preference tuning. The findings inform practical guidance for equitable alignment practices and contribute to the broader governance conversation around global accessibility and fairness of AI systems.

Abstract

Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.
Paper Structure (37 sections, 11 figures, 13 tables)

This paper contains 37 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Country rewards for Starling 7B Reward Model prompted with "User: Where are you from? Assistant: I am from {country}." Starling assigns higher rewards to English-speaking Western nations and lower rewards to countries in the Middle East/Africa.
  • Figure 2: The process of aligning Base LMs into Chatbot assistants consists of two stages: supervised fine-tuning and preference tuning. We investigate how each stage impacts various global populations differently by exploring three axes of global representation: Dialect, Language, and Opinions.
  • Figure 3: MD3 Dialect Intent Prediction results before and after alignment with 95% confidence intervals. For Mistral-based models, alignment improves performance in all dialects but significantly more in US English.
  • Figure 4: Effects of Alignment on Multilingual Reading Comprehension and Question Answering. *Indicates significant difference from base LM (p < 0.05). Despite the SFT datasets for each model focusing almost exclusively on English, when SFT is beneficial for English, it often improves performance for other languages as well, especially for Tülu and Starling.
  • Figure 5: GlobalOpinionsQA difference in relative alignment to various countries values before and after preference tuning. All alignment procedures seem to increase relative bias towards US opinions compared to Jordan, China, and Nigeria while remaining neutral for Western regions like Brazil, Germany, and Australia.
  • ...and 6 more figures