Table of Contents
Fetching ...

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Yang Han, Yiming Wang, Rui Wang, Lu Chen, Kai Yu

TL;DR

Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations, demonstrating that AlignSum significantly enhances the alignment of language models with human summarization preferences.

Abstract

Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availability of high-quality human-annotated data that reflect true human preference. To address this challenge, we introduce a novel human summarization preference alignment framework AlignSum. This framework consists of three parts: Firstly, we construct a Data Pymarid with extractive, abstractive, and human-annotated summary data. Secondly, we conduct the Gaussian Resampling to remove summaries with extreme lengths. Finally, we implement the two-stage hierarchical fine-tuning with Data Pymarid after Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations. This demonstrates that AlignSum significantly enhances the alignment of language models with human summarization preferences.

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

TL;DR

Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations, demonstrating that AlignSum significantly enhances the alignment of language models with human summarization preferences.

Abstract

Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availability of high-quality human-annotated data that reflect true human preference. To address this challenge, we introduce a novel human summarization preference alignment framework AlignSum. This framework consists of three parts: Firstly, we construct a Data Pymarid with extractive, abstractive, and human-annotated summary data. Secondly, we conduct the Gaussian Resampling to remove summaries with extreme lengths. Finally, we implement the two-stage hierarchical fine-tuning with Data Pymarid after Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations. This demonstrates that AlignSum significantly enhances the alignment of language models with human summarization preferences.
Paper Structure (48 sections, 5 equations, 5 figures, 18 tables)

This paper contains 48 sections, 5 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Results (scaled to 0-1) of automatic score R OUGElin2004rouge and human rating GEvalliu2023gpteval on the standard dataset CNN/DailyMail. It is obvious that PLMs perform better than LLMs on automatic scores but worse on human ratings.
  • Figure 2: The overall pipeline of our summarization preference alignment framework AlignSum.
  • Figure 3: Summary token length distributions of DP.
  • Figure 4: Reference-based human evaluation of BART (w/ full DP) and GPT-3 (w/CoT) compared to the golden reference on CNN/DailyMail and BBC XSum.
  • Figure 5: R OUGE-1/L of fine-tuning B ART-Large with ED, AD, HD on CNN/DailyMail and BBC XSum.