Table of Contents
Fetching ...

Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

Joonho Yang, Seunghyun Yoon, Hwan Chang, Byeongjeong Kim, Hwanhee Lee

TL;DR

This work investigates the Hallucinate at the Last phenomenon, where faithfulness deteriorates in the final portion of long outputs generated from extended contexts. Through a broad empirical study across multiple LLMs, domains, and long-output prompts, the authors establish a consistent end-of-output decline in faithfulness and introduce a sensitivity metric to quantify it. They analyze potential causes in summarization structure and attention dynamics and evaluate mitigation strategies, finding that sliding-window attention and chunk-based generation can partially alleviate end hallucinations but do not fully solve faithfulness in long-context generation. The findings highlight the importance of accounting for the output's temporal position in designing robust long-form generation systems and guide future research toward position-aware decoding and context integration techniques.

Abstract

Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.

Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

TL;DR

This work investigates the Hallucinate at the Last phenomenon, where faithfulness deteriorates in the final portion of long outputs generated from extended contexts. Through a broad empirical study across multiple LLMs, domains, and long-output prompts, the authors establish a consistent end-of-output decline in faithfulness and introduce a sensitivity metric to quantify it. They analyze potential causes in summarization structure and attention dynamics and evaluate mitigation strategies, finding that sliding-window attention and chunk-based generation can partially alleviate end hallucinations but do not fully solve faithfulness in long-context generation. The findings highlight the importance of accounting for the output's temporal position in designing robust long-form generation systems and guide future research toward position-aware decoding and context integration techniques.

Abstract

Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.

Paper Structure

This paper contains 44 sections, 7 equations, 8 figures, 22 tables.

Figures (8)

  • Figure 1: Comparison of the faithfulness scores of summaries generated around 50, 200, and 800 words in length for a 6K-token length Wikipedia context using GPT-4o mini.
  • Figure 2: Comparison of faithfulness scores for standard summaries generated by different models across increasing output lengths on the Wikipedia dataset, with context lengths ranging from 1K to 8K.
  • Figure 3: Comparison of faithfulness scores for long summaries generated by different models across increasing output lengths on the Wikipedia dataset, with context lengths ranging from 1K to 8K.
  • Figure 4: Comparison of faithfulness scores in long summaries generated by the Llama3.1-8B-Instruct model across increasing output lengths in multiple domains, with context lengths ranging from 1K to 8K.
  • Figure 5: Faithfulness scores of reference summaries and those generated by Llama on the CNNDM dataset.
  • ...and 3 more figures