How Does Response Length Affect Long-Form Factuality
James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng
TL;DR
The study investigates how response length affects factuality in long-form LLM outputs and introduces Bafe, a Bi-Level Atomic Fact Evaluation framework that decomposes responses into atomic facts and verifies them first via Wikipedia and then via a single Google Search fallback. Through controlled experiments on biography generation and long fact descriptions, the authors demonstrate a clear length bias: factual precision declines as output length increases. They evaluate three hypotheses—error propagation, long context, and facts exhaustion—and find that facts exhaustion is the primary driver of degradation, with error propagation being weak and long-context not a major factor. These findings advance our understanding of length-related factuality challenges and suggest that improving fact retrieval strategies and multi-topic coverage could mitigate degradation in long-form generation.
Abstract
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
