How Far Can We Extract Diverse Perspectives from Large Language Models?

Shirley Anugrah Hayati; Minhwa Lee; Dheeraj Rajagopal; Dongyeop Kang

How Far Can We Extract Diverse Perspectives from Large Language Models?

Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, Dongyeop Kang

TL;DR

The paper investigates how far large language models can generate diverse human perspectives on subjective topics, introducing a criteria-based prompting framework to ground responses in value-driven criteria and a step-by-step recall prompting method to measure diversity coverage. Through experiments across social norms, argumentation, hate speech labeling, and moral-story continuation, the authors show LLMs can reach diversity levels comparable to humans on more subjective tasks, with saturation depending on task subjectivity. They compare criteria-based prompts to free-form prompts and conduct human evaluations, finding that LLMs often align with human perspectives but may emphasize different criteria, and that combining multiple humans with LLM prompts can enhance diversity. The study provides a framework for generating diverse data for open-ended tasks, discusses the implications for fairness and bias, and suggests future work on real-world distributions, cultural factors, and interactive human-LLM collaboration for richer diversity.

Abstract

Collecting diverse human opinions is costly and challenging. This leads to a recent trend in exploiting large language models (LLMs) for generating diverse data for potential scalable and efficient solutions. However, the extent to which LLMs can generate diverse perspectives on subjective topics is still unclear. In this study, we explore LLMs' capacity of generating diverse perspectives and rationales on subjective topics such as social norms and argumentative texts. We introduce the problem of extracting maximum diversity from LLMs. Motivated by how humans form opinions based on values, we propose a criteria-based prompting technique to ground diverse opinions. To see how far we can extract diverse perspectives from LLMs, or called diversity coverage, we employ a step-by-step recall prompting to generate more outputs from the model iteratively. Our methods, applied to various tasks, show that LLMs can indeed produce diverse opinions according to the degree of task subjectivity. We also find that LLM's performance of extracting maximum diversity is on par with human.

How Far Can We Extract Diverse Perspectives from Large Language Models?

TL;DR

Abstract

Paper Structure (50 sections, 14 figures, 13 tables)

This paper contains 50 sections, 14 figures, 13 tables.

Introduction
Contributions
Criteria-based Diversity Prompting
Motivation
Step 1: Think of Your Criteria First before Making Opinions
Task Definition
Criteria-Based vs. Free-form
Human Evaluation on Model-Generated Opinions
Step 2: Step-by-Step Recall Prompting to Maximize Diversity Incrementally
Experimental Setups
Models and Prompting
Datasets
Social-Chem-101
Change My View (CMV)
Hate Speech
...and 35 more sections

Figures (14)

Figure 1: LLMs are trained on texts written by different people who may have distinct perspectives. Our study examines whether LLMs can do "reverse modeling" of humans' perspectives from the training data and how much diversity coverage LLMs can generate. (A check mark = "Agree" and a cross mark ="Disagree")
Figure 2: People can have different opinions given a subjective statement. Given a statement, humans can agree or disagree with the statement with their own criteria (e.g., teamwork, risk-taking) in deciding their stances.
Figure 3: Step-by-step recall prompting. The statement and first generated opinion become the demonstration for prompting the LLM to generate $N$ opinions. The blue-colored parts (Steps 1 and 2) are done incrementally with step size = 3.
Figure 4: Semantic diversity score for different LLMs and prompting methods for Social-Chem-101 (left) and CMV (right) datasets. Criteria-based prompting is the best diversity extraction method for across LLM variants, datasets, and various shots. We also found that too many examples may hurt diversity (5-shot results). The results on Social-Chem-101 are statistically significant with p < 0.05 (GPT-4) and p < 0.01 (the rest of the models) and p < 0.01 for GPT-3 and Mixtral for CMV.
Figure 5: X-axis = the number of generated opinions for our diversity coverage experiment. Y-axis = the average number of unique criteria clusters for all statements. Moral Stories do not have stances, so the line is only for all generated continued stories. The more subjective a task is, the more LLM can generate unique criteria clusters.
...and 9 more figures

How Far Can We Extract Diverse Perspectives from Large Language Models?

TL;DR

Abstract

How Far Can We Extract Diverse Perspectives from Large Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (14)