Table of Contents
Fetching ...

Know When To Stop: A Study of Semantic Drift in Text Generation

Ava Spataru, Eric Hambro, Elena Voita, Nicola Cancedda

TL;DR

The paper reveals that modern LLMs tend to generate correct facts early in long-form text and drift into falsehoods as generation continues. It formalizes semantic drift with the SD score and evaluates factuality using the FActScore framework on Wikipedia-style bios, finding high drift across several models. Practical mitigations, including early stopping and a resample-then-rerank pipeline, substantially improve factual accuracy while balancing information quantity and compute; attempting to repair drift via external QA APIs shows limited benefit. The methods generalize to long-form factual generation beyond bios, enabling more reliable content with manageable costs, and lay groundwork for further drift detection and mitigation research.

Abstract

In this work, we explicitly show that modern LLMs tend to generate correct facts first, then "drift away" and generate incorrect facts later: this was occasionally observed but never properly measured. We develop a semantic drift score that measures the degree of separation between correct and incorrect facts in generated texts and confirm our hypothesis when generating Wikipedia-style biographies. This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation. Therefore, we explore the trade-off between information quantity and factual accuracy for several early stopping methods and manage to improve factuality by a large margin. We further show that reranking with semantic similarity can further improve these results, both compared to the baseline and when combined with early stopping. Finally, we try calling external API to bring the model back to the right generation path, but do not get positive results. Overall, our methods generalize and can be applied to any long-form text generation to produce more reliable information, by balancing trade-offs between factual accuracy, information quantity and computational cost.

Know When To Stop: A Study of Semantic Drift in Text Generation

TL;DR

The paper reveals that modern LLMs tend to generate correct facts early in long-form text and drift into falsehoods as generation continues. It formalizes semantic drift with the SD score and evaluates factuality using the FActScore framework on Wikipedia-style bios, finding high drift across several models. Practical mitigations, including early stopping and a resample-then-rerank pipeline, substantially improve factual accuracy while balancing information quantity and compute; attempting to repair drift via external QA APIs shows limited benefit. The methods generalize to long-form factual generation beyond bios, enabling more reliable content with manageable costs, and lay groundwork for further drift detection and mitigation research.

Abstract

In this work, we explicitly show that modern LLMs tend to generate correct facts first, then "drift away" and generate incorrect facts later: this was occasionally observed but never properly measured. We develop a semantic drift score that measures the degree of separation between correct and incorrect facts in generated texts and confirm our hypothesis when generating Wikipedia-style biographies. This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation. Therefore, we explore the trade-off between information quantity and factual accuracy for several early stopping methods and manage to improve factuality by a large margin. We further show that reranking with semantic similarity can further improve these results, both compared to the baseline and when combined with early stopping. Finally, we try calling external API to bring the model back to the right generation path, but do not get positive results. Overall, our methods generalize and can be applied to any long-form text generation to produce more reliable information, by balancing trade-offs between factual accuracy, information quantity and computational cost.
Paper Structure (74 sections, 1 equation, 10 figures, 7 tables)

This paper contains 74 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A visual example of calculating semantic drift (SD) score for paragraph $P$. The position which best splits the paragraph is $k=8$. The proportion of supported facts to the left is 0.88 and the proportion of not-supported facts to the right is 0.78, giving an average of 0.83. The other positions all have lower SD scores, therefore the SD score of paragraph $P$ is 0.83.
  • Figure 2: Distribution of Semantic Drift Score (after filtering) in paragraphs generated by LLaMa2-70B (sampling: temperature=0.6, top-p=0.9).
  • Figure 3: Semantic drift score density plot for person popularity classes. LLaMa2-70B.
  • Figure 4: Trade-off between informativeness (y-axis) and factuality (x-axis) for proposed generation strategies; average over 500 biographical paragraphs.
  • Figure 5: Examples of biographies that were most improved by adding QA calls. Each row represents a biography with two generated versions (one without QA calls and one with). Green -- correct facts, red -- incorrect facts, blue -- API calls.
  • ...and 5 more figures