Table of Contents
Fetching ...

How Does Response Length Affect Long-Form Factuality

James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng

TL;DR

The study investigates how response length affects factuality in long-form LLM outputs and introduces Bafe, a Bi-Level Atomic Fact Evaluation framework that decomposes responses into atomic facts and verifies them first via Wikipedia and then via a single Google Search fallback. Through controlled experiments on biography generation and long fact descriptions, the authors demonstrate a clear length bias: factual precision declines as output length increases. They evaluate three hypotheses—error propagation, long context, and facts exhaustion—and find that facts exhaustion is the primary driver of degradation, with error propagation being weak and long-context not a major factor. These findings advance our understanding of length-related factuality challenges and suggest that improving fact retrieval strategies and multi-topic coverage could mitigate degradation in long-form generation.

Abstract

Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

How Does Response Length Affect Long-Form Factuality

TL;DR

The study investigates how response length affects factuality in long-form LLM outputs and introduces Bafe, a Bi-Level Atomic Fact Evaluation framework that decomposes responses into atomic facts and verifies them first via Wikipedia and then via a single Google Search fallback. Through controlled experiments on biography generation and long fact descriptions, the authors demonstrate a clear length bias: factual precision declines as output length increases. They evaluate three hypotheses—error propagation, long context, and facts exhaustion—and find that facts exhaustion is the primary driver of degradation, with error propagation being weak and long-context not a major factor. These findings advance our understanding of length-related factuality challenges and suggest that improving fact retrieval strategies and multi-topic coverage could mitigate degradation in long-form generation.

Abstract

Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

Paper Structure

This paper contains 50 sections, 1 equation, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Factual precision gradually decreases as response length increases, which demonstrates the existence of length bias in long-form factuality. Response length is measured by word count (split by spaces).
  • Figure 2: The pipeline of Bafe, an automatic bi-level long-form factuality evaluator. It first decomposes long responses into atomic facts. Each atomic fact is compared with a retrieved Wikipedia page at first-level verification. Atomic facts that are not supported at first-level, will reach second-level verification, where they are revised to be self-contained and checked against Google Search results. Human evaluation shows superior performance of Bafe with a low cost in Section \ref{['sec:overview_BAFE']}.
  • Figure 3: Factual precision decreases as response length increases in the long fact description task. Response length is measured by word count, using space as the delimiter.
  • Figure 4: Autocorrelation coefficient at different lags. Results are aggregated over all responses to compute the average autocorrelation coefficient for each lag. The 95% confidence intervals are obtained with 2000 times bootstrap resampling. Only the coefficient at lag 1 is slightly higher than 0 with statistical significance.
  • Figure 5: Factual precision in the evaluation section across varying context lengths and topics. The evaluation section is set to "Career" topic with fixed length. Results are obtained over three runs. As the context length increases, factual precision does not decline significantly.
  • ...and 5 more figures