Table of Contents
Fetching ...

Learning to Summarize from LLM-generated Feedback

Hwanjun Song, Taewon Yun, Yuho Lee, Jihwan Oh, Gihun Lee, Jason Cai, Hang Su

TL;DR

This work investigates how to improve text summarization by learning from LLM-generated feedback. It introduces FeedSum, a large-scale dataset of document–summary pairs annotated with multi-dimensional feedback across faithfulness, completeness, and conciseness, and analyzes how feedback quality, dimensionality, and granularity affect preference learning. Through systematic experiments, the authors compare supervised fine-tuning and direct preference optimization, finding that DPO with high-quality, multi-dimensional, fine-grained feedback yields substantial gains, even enabling a smaller model (SummLlama3-8b) to outperform a much larger baseline. The study also examines feedback size, human versus synthetic feedback, and alternative optimization methods, providing practical recommendations for scalable alignment of summarization models. The released FeedSum and SummLlama3-8B enable broader adoption of preference-based alignment in summarization tasks.

Abstract

Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset and SummLlama3-8B model are available at https://huggingface.co/datasets/DISLab/FeedSum and https://huggingface.co/DISLab/SummLlama3-8B.

Learning to Summarize from LLM-generated Feedback

TL;DR

This work investigates how to improve text summarization by learning from LLM-generated feedback. It introduces FeedSum, a large-scale dataset of document–summary pairs annotated with multi-dimensional feedback across faithfulness, completeness, and conciseness, and analyzes how feedback quality, dimensionality, and granularity affect preference learning. Through systematic experiments, the authors compare supervised fine-tuning and direct preference optimization, finding that DPO with high-quality, multi-dimensional, fine-grained feedback yields substantial gains, even enabling a smaller model (SummLlama3-8b) to outperform a much larger baseline. The study also examines feedback size, human versus synthetic feedback, and alternative optimization methods, providing practical recommendations for scalable alignment of summarization models. The released FeedSum and SummLlama3-8B enable broader adoption of preference-based alignment in summarization tasks.

Abstract

Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset and SummLlama3-8B model are available at https://huggingface.co/datasets/DISLab/FeedSum and https://huggingface.co/DISLab/SummLlama3-8B.

Paper Structure

This paper contains 58 sections, 8 equations, 2 figures, 23 tables.

Figures (2)

  • Figure 1: FeedSum consists of summaries of varying quality, generated by 13 different summarizers across input documents from 7 distinct domains. Through automated evaluation using LLMs, 125K document-summary pairs have been produced, each accompanied by LLM-generated multi-dimensional feedback, providing valuable data for preference learning.
  • Figure 2: Distribution of summary scores on a 1–5 Likert scale across the four different configurations. Percentage scores in C4 are converted into Likert-scale ones through uniform quantization for ease of interpretation.