Table of Contents
Fetching ...

Coverage-based Fairness in Multi-document Summarization

Haoyuan Li, Yusen Zhang, Rui Zhang, Snigdha Chaturvedi

TL;DR

The paper tackles fairness in multi-document summarization by identifying limitations of Proportional Representation in handling redundancy and corpus-level fairness. It introduces two coverage-based measures, Equal Coverage (EC) for summary-level fairness and Coverage Parity (CP) for corpus-level fairness, and validates them through human judgments and extensive LLM evaluations across five domains. Empirical results show EC generally aligns with fairness definitions better than PR, while CP reveals corpus-wide biases that vary by model and domain; Claude3-sonnet often emerges as the fairest, though most models overrepresent certain social attributes. The work provides actionable metrics for both researchers and practitioners to assess and mitigate fairness biases in LLM-based MDS systems, encouraging dataset balancing and model tuning informed by EC and CP.

Abstract

Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values. The code is available at https://github.com/leehaoyuan/coverage_fairness.

Coverage-based Fairness in Multi-document Summarization

TL;DR

The paper tackles fairness in multi-document summarization by identifying limitations of Proportional Representation in handling redundancy and corpus-level fairness. It introduces two coverage-based measures, Equal Coverage (EC) for summary-level fairness and Coverage Parity (CP) for corpus-level fairness, and validates them through human judgments and extensive LLM evaluations across five domains. Empirical results show EC generally aligns with fairness definitions better than PR, while CP reveals corpus-wide biases that vary by model and domain; Claude3-sonnet often emerges as the fairest, though most models overrepresent certain social attributes. The work provides actionable metrics for both researchers and practitioners to assess and mitigate fairness biases in LLM-based MDS systems, encouraging dataset balancing and model tuning informed by EC and CP.

Abstract

Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values. The code is available at https://github.com/leehaoyuan/coverage_fairness.

Paper Structure

This paper contains 34 sections, 7 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Issues with existing fairness measures for summary-level (a) and corpus-level (b) fairness.
  • Figure 2: Example prompt for splitting and rephrasing summary sentences (top) and summary sentences after splitting and rephrasing (bottom).
  • Figure 3: Interface for Human Evaluation
  • Figure 4: Summarization prompt for the Amazon Dataset.
  • Figure 5: Summarization prompt for the News Stance Dataset.
  • ...and 2 more figures