Table of Contents
Fetching ...

Fair Abstractive Summarization of Diverse Perspectives

Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang

TL;DR

This paper formally defines fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and proposes four reference-free automatic metrics by measuring the differences between target and source perspectives.

Abstract

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.

Fair Abstractive Summarization of Diverse Perspectives

TL;DR

This paper formally defines fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and proposes four reference-free automatic metrics by measuring the differences between target and source perspectives.

Abstract

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.
Paper Structure (64 sections, 4 equations, 10 figures, 15 tables)

This paper contains 64 sections, 4 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: An example from PerspectiveSumm. The blue/red box displays the input consisting of positive/negative reviews. The grey box shows the summary generated by GPT-3.5 (text-davinci-003). The generated summary is unfair because the negative reviews are underrepresented compared with the positive reviews.
  • Figure 2: Overview of our proposed metrics. Dist. means value distribution.
  • Figure 3: Relation between temperature and correlation scores on Claritin using gpt-turbo-3.5. X-axis is the softmax temperature of BARTScore. Y-axis is the Krippendorff’s alpha and Pearson correlation coefficient with human evaluation. Pearson correlation coefficient is higher than Krippendorff's alpha because Pearson correlation coefficient only computes positive relations while Krippendorff's alpha requires the annotations to be the same.
  • Figure 4: Distribution of Male and Female values in summaries generated by gpt-turbo-3.5 on Claritin.
  • Figure 5: Effect of decoding temperature.
  • ...and 5 more figures