Evaluating Large Language Model Biases in Persona-Steered Generation

Andy Liu; Mona Diab; Daniel Fried

Evaluating Large Language Model Biases in Persona-Steered Generation

Andy Liu, Mona Diab, Daniel Fried

TL;DR

This work investigates how large language models reflect multifaceted personas in open-ended generation, introducing incongruous versus congruous personas derived from Pew OpinionsQA data. The authors establish a persona-steered generation task, evaluate steerability with GPT-4 (validated against human annotations), and compare several models and fine-tuning methods (including RLHF and DPO). Key findings show that models are consistently more steerable toward congruous personas, that RLHF improves steerability but reduces semantic diversity, and that steerability in multiple-choice settings only weakly predicts open-ended performance. The study highlights potential social harms from biased representation, such as increased polarization and narrower viewpoints, and argues for open-ended evaluation and richer persona representations to surface and mitigate biases in LLM simulations.

Abstract

The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.

Evaluating Large Language Model Biases in Persona-Steered Generation

TL;DR

Abstract

Paper Structure (28 sections, 4 figures, 8 tables)

This paper contains 28 sections, 4 figures, 8 tables.

Introduction
Methods
Persona-Steered Generation Setting
Steerability Evaluation
Additional Metrics
Results and Discussion
GPT-4 is a Strong Proxy for Human Evaluation [RQ4]
Steerability by Stance Type
Fine-Tuning Improves Steerability, but Stances Benefit Unequally [RQ2]
Steerability by Stance is Not Predictable from Model Survey Response Rates [RQ3]
Steerability Towards Congruous and Incongruous Personas [RQ1]
All Models are Worse at Representing Incongruous Personas
Steering Towards Incongruous Personas Reduces Diversity and Susceptibility to Caricature
Differences in Steerability Could Lead to Social Harms
Related Work
...and 13 more sections

Figures (4)

Figure 1: The process by which we construct personas from human data to evaluate LLM steerability. We find that LLMs are less steerable towards incongruous personas, defined as personas where identifying as the demographic of the persona causes a Pew survey respondent to be less likely to take its stance. When given an incongruous persona, models often default to the stereotypical stance associated with a demographic, despite being explicitly directed to take the opposite stance.
Figure 2: An example of how prompts are constructed for our persona-steered generation task. A persona consists of a demographic as well as a stance on an issue that is relevant to the demographic. We vary the order of elements within the persona to test sensitivity to prompt wording.
Figure 3: Mean steerability of Llama and Tulu models towards stances most commonly associated with each demographic, grouped by the method used to fine-tune each model. We report bootstrapped 95% confidence intervals in addition to the means. Models fine-tuned with RLHF and DPO are significantly more steerable towards all stances, especially those associated with women and political liberals.
Figure 4: An example of the human annotation interface we use to validate our choice of GPT-4 as an evaluator model. Annotators are prompted with a statement, as well as both the stance and opposing stance that a statement was generated from. They are then asked to select the stance that is more likely to make the statement.

Evaluating Large Language Model Biases in Persona-Steered Generation

TL;DR

Abstract

Evaluating Large Language Model Biases in Persona-Steered Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)