GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

Leyan Wang; Yonggang Jin; Tianhao Shen; Tianyu Zheng; Xinrun Du; Chenchen Zhang; Wenhao Huang; Jiaheng Liu; Shi Wang; Ge Zhang; Liuyu Xiang; Zhaofeng He

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

Leyan Wang, Yonggang Jin, Tianhao Shen, Tianyu Zheng, Xinrun Du, Chenchen Zhang, Wenhao Huang, Jiaheng Liu, Shi Wang, Ge Zhang, Liuyu Xiang, Zhaofeng He

TL;DR

GIEBench is a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities, designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race.

Abstract

As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals' group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at https://github.com/GIEBench/GIEBench.

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 3 figures, 8 tables)

This paper contains 26 sections, 3 figures, 8 tables.

Introduction
Related Work
Empathy Evaluation of LLMs
Value Pluralism of LLMs
GIEBench
Plural Controversial Topics Generation
Internet Sourcing
GPT-4 Based Synthetic Topic Generation
Human Annotation
Prompt Construction and Pipeline
Data Statistics
Results and Analysis
Evaluating Plurality in LLMs
Experiment Settings
Main Results
...and 11 more sections

Figures (3)

Figure 1: The proportion of the eleven identity dimensions in GIEBench. The categories of Gender and Occupation have the smallest proportions, accounting for 6.71% and 6%, respectively. The proportions of the remaining categories are all around 10%. A broad range of categories facilitates our evaluation of LLMs’ performance across various identity standpoints.
Figure 2: The process of constructing GIEBench. Initially, a collection of controversial topics is developed using web resources, manual selection, and GPT-4, each corresponding to a specific identity. Subsequently, we annotate attitude labels from the perspectives of these identities. We also utilize GPT-4 to generate four responses for each topic, ensuring that only one response aligns with the identity's stance. Finally, using the established identities, topics, and responses, we design three types of prompts to LLMs in selecting the most appropriate response. In the COT-Prompt, a Chain of Thought (COT) is provided along with identity information. In the ID-Prompt, only the identity is disclosed, while the Raw-Prompt includes no additional information.
Figure 3: The figure illustrates the accuracy of six different series of Large Language Models (LLMs) on our dataset based on COT-Prompt. Overall, GPT-4-turbo performs better, which, to some extent, reflects its superior alignment across various identity positions.

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

TL;DR

Abstract

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)