AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Yang Han; Yiming Wang; Rui Wang; Lu Chen; Kai Yu

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Yang Han, Yiming Wang, Rui Wang, Lu Chen, Kai Yu

TL;DR

Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations, demonstrating that AlignSum significantly enhances the alignment of language models with human summarization preferences.

Abstract

Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availability of high-quality human-annotated data that reflect true human preference. To address this challenge, we introduce a novel human summarization preference alignment framework AlignSum. This framework consists of three parts: Firstly, we construct a Data Pymarid with extractive, abstractive, and human-annotated summary data. Secondly, we conduct the Gaussian Resampling to remove summaries with extreme lengths. Finally, we implement the two-stage hierarchical fine-tuning with Data Pymarid after Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations. This demonstrates that AlignSum significantly enhances the alignment of language models with human summarization preferences.

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

TL;DR

Abstract

Paper Structure (48 sections, 5 equations, 5 figures, 18 tables)

This paper contains 48 sections, 5 equations, 5 figures, 18 tables.

Introduction
AlignSum: Summarization Preference Alignment Framework
Data Pyramid Construction
Extractive Data.
Abstractive Data.
Human-annotated Data.
Gaussian Resampling
Two-stage Hierarchical Fine-tuning
Why Hierarchical Fine-tuning?
Experiments
Setup
Dataset.
Data Statistics.
Baselines.
Implementation.
...and 33 more sections

Figures (5)

Figure 1: Results (scaled to 0-1) of automatic score R OUGElin2004rouge and human rating GEvalliu2023gpteval on the standard dataset CNN/DailyMail. It is obvious that PLMs perform better than LLMs on automatic scores but worse on human ratings.
Figure 2: The overall pipeline of our summarization preference alignment framework AlignSum.
Figure 3: Summary token length distributions of DP.
Figure 4: Reference-based human evaluation of BART (w/ full DP) and GPT-3 (w/CoT) compared to the golden reference on CNN/DailyMail and BBC XSum.
Figure 5: R OUGE-1/L of fine-tuning B ART-Large with ED, AD, HD on CNN/DailyMail and BBC XSum.

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

TL;DR

Abstract

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Authors

TL;DR

Abstract

Table of Contents

Figures (5)