Table of Contents
Fetching ...

Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning

Sicong Huang, Qianqi Yan, Shengze Wang, Ian Lane

TL;DR

This work tackles hallucinations in abstractive summarization by creating a dataset of span-level labeled summaries annotated with GPT-4o and training LLMs with span-aware fine-tuning. It evaluates three methods—gradient ascent, unlikelihood training, and task vector negation—using both faithful and unfaithful spans weighted by a factor $\epsilon$, finding that unlikelihood training most reliably improves faithfulness across CNNDM, SAMSum, and XSum. The study highlights the practical value of span-level annotations for reducing unfaithful content and provides guidance on method robustness and hyperparameter sensitivity for real-world deployment. Limitations include annotation reliability and the need to compare with recent alignment methods, pointing to avenues for future work.

Abstract

Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.

Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning

TL;DR

This work tackles hallucinations in abstractive summarization by creating a dataset of span-level labeled summaries annotated with GPT-4o and training LLMs with span-aware fine-tuning. It evaluates three methods—gradient ascent, unlikelihood training, and task vector negation—using both faithful and unfaithful spans weighted by a factor , finding that unlikelihood training most reliably improves faithfulness across CNNDM, SAMSum, and XSum. The study highlights the practical value of span-level annotations for reducing unfaithful content and provides guidance on method robustness and hyperparameter sensitivity for real-world deployment. Limitations include annotation reliability and the need to compare with recent alignment methods, pointing to avenues for future work.

Abstract

Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.

Paper Structure

This paper contains 30 sections, 3 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 2: Training data construction: summaries of the source documents used for model training are generated using an LLM. Spans of text in the generated summaries that are unfaithful to the source document are automatically labeled using GPT-4o (using the prompt in Appendix A.3). Summaries that have no unfaithful spans labeled in their output are treated as positive training samples, and summaries that contain unfaithful spans are treated as negative training samples.
  • Figure 3: Model update: a base model is updated using both the faithful positive example summaries and the unfaithful negative example summaries with hallucination spans using one of three approaches we compare in this paper (1) Gradient Ascent, (2) Unlikelihood Training or (3) Task Vector Negation.
  • Figure 4: Average G-Eval on all three datasets with models trained using different $\epsilon$.