Table of Contents
Fetching ...

Diverse, but Divisive: LLMs Can Exaggerate Gender Differences in Opinion Related to Harms of Misinformation

Terrence Neumann, Sooyong Lee, Maria De-Arteaga, Sina Fazelpour, Matthew Lease

TL;DR

This work examines the risks and dynamics of using LLMs to assist claim prioritization in fact-checking, focusing on gendered perceptions of misinformation harms. By building the TopicMisinfo dataset (160 claims with ~1592 human annotations) and evaluating GPT-3.5 Turbo under gender-conditioned and neutral prompts, the authors quantify how AI reflects and sometimes amplifies gender differences using bootstrap statistics such as $\hat{E}_{\omega}$ and $MSE_{\omega}^{gender}$. Findings show that while LLMs can mirror some human gender differences, they frequently exaggerate disagreements, and neutral prompts tend to align more with men on contentious topics like abortion. The study highlights procedural fairness concerns for fact-checking organizations, provides guidance for LLM developers, and offers a publicly released dataset to foster ongoing research in responsible AI-assisted fact-checking.

Abstract

The pervasive spread of misinformation and disinformation poses a significant threat to society. Professional fact-checkers play a key role in addressing this threat, but the vast scale of the problem forces them to prioritize their limited resources. This prioritization may consider a range of factors, such as varying risks of harm posed to specific groups of people. In this work, we investigate potential implications of using a large language model (LLM) to facilitate such prioritization. Because fact-checking impacts a wide range of diverse segments of society, it is important that diverse views are represented in the claim prioritization process. This paper examines whether a LLM can reflect the views of various groups when assessing the harms of misinformation, focusing on gender as a primary variable. We pose two central questions: (1) To what extent do prompts with explicit gender references reflect gender differences in opinion in the United States on topics of social relevance? and (2) To what extent do gender-neutral prompts align with gendered viewpoints on those topics? To analyze these questions, we present the TopicMisinfo dataset, containing 160 fact-checked claims from diverse topics, supplemented by nearly 1600 human annotations with subjective perceptions and annotator demographics. Analyzing responses to gender-specific and neutral prompts, we find that GPT 3.5-Turbo reflects empirically observed gender differences in opinion but amplifies the extent of these differences. These findings illuminate AI's complex role in moderating online communication, with implications for fact-checkers, algorithm designers, and the use of crowd-workers as annotators. We also release the TopicMisinfo dataset to support continuing research in the community.

Diverse, but Divisive: LLMs Can Exaggerate Gender Differences in Opinion Related to Harms of Misinformation

TL;DR

This work examines the risks and dynamics of using LLMs to assist claim prioritization in fact-checking, focusing on gendered perceptions of misinformation harms. By building the TopicMisinfo dataset (160 claims with ~1592 human annotations) and evaluating GPT-3.5 Turbo under gender-conditioned and neutral prompts, the authors quantify how AI reflects and sometimes amplifies gender differences using bootstrap statistics such as and . Findings show that while LLMs can mirror some human gender differences, they frequently exaggerate disagreements, and neutral prompts tend to align more with men on contentious topics like abortion. The study highlights procedural fairness concerns for fact-checking organizations, provides guidance for LLM developers, and offers a publicly released dataset to foster ongoing research in responsible AI-assisted fact-checking.

Abstract

The pervasive spread of misinformation and disinformation poses a significant threat to society. Professional fact-checkers play a key role in addressing this threat, but the vast scale of the problem forces them to prioritize their limited resources. This prioritization may consider a range of factors, such as varying risks of harm posed to specific groups of people. In this work, we investigate potential implications of using a large language model (LLM) to facilitate such prioritization. Because fact-checking impacts a wide range of diverse segments of society, it is important that diverse views are represented in the claim prioritization process. This paper examines whether a LLM can reflect the views of various groups when assessing the harms of misinformation, focusing on gender as a primary variable. We pose two central questions: (1) To what extent do prompts with explicit gender references reflect gender differences in opinion in the United States on topics of social relevance? and (2) To what extent do gender-neutral prompts align with gendered viewpoints on those topics? To analyze these questions, we present the TopicMisinfo dataset, containing 160 fact-checked claims from diverse topics, supplemented by nearly 1600 human annotations with subjective perceptions and annotator demographics. Analyzing responses to gender-specific and neutral prompts, we find that GPT 3.5-Turbo reflects empirically observed gender differences in opinion but amplifies the extent of these differences. These findings illuminate AI's complex role in moderating online communication, with implications for fact-checkers, algorithm designers, and the use of crowd-workers as annotators. We also release the TopicMisinfo dataset to support continuing research in the community.
Paper Structure (23 sections, 12 equations, 1 figure, 7 tables)

This paper contains 23 sections, 12 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: These charts, relevant to RQ1, illustrate variations in average responses when evaluating the perceived harm to specific demographic groups for claims across topics. Chart A compares the differences in opinions between human men and women with those generated by AI using gender-conditioned prompts on topics prone to divergent viewpoints. Here, we observe that prompts tend to capture and even exaggerate the range of opinions found in human responses. In contrast, Chart B reveals that AI models tend to forecast significant levels of disagreement on topics that typically do not cause such discordance in real-world scenarios, such as Health and Science and Weather and Climate.