Table of Contents
Fetching ...

Fine-Grained Self-Endorsement Improves Factuality and Reasoning

Ante Wang, Linfeng Song, Baolin Peng, Ye Tian, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu

TL;DR

This paper tackles fact-conflicting hallucinations in large language models by introducing self-endorsement, a prompting-based inference-time framework that performs fine-grained fact-level cross-response verification across multiple samples. By decomposing each candidate into facts and computing endorsement scores via cross-candidate checks (and optional context pruning), it either selects the best candidate or regenerates a final answer conditioned on high-quality facts. Empirical results on Biographies, TriviaQA, and GSM8K show notable factuality gains for open-source and smaller LLMs, with endorsement scores correlating positively with factuality and improvements persisting under various hyperparameters. The approach offers a practical, scalable solution for reducing hallucinations in real-world settings and has potential for broader application beyond the tested domains.

Abstract

This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.

Fine-Grained Self-Endorsement Improves Factuality and Reasoning

TL;DR

This paper tackles fact-conflicting hallucinations in large language models by introducing self-endorsement, a prompting-based inference-time framework that performs fine-grained fact-level cross-response verification across multiple samples. By decomposing each candidate into facts and computing endorsement scores via cross-candidate checks (and optional context pruning), it either selects the best candidate or regenerates a final answer conditioned on high-quality facts. Empirical results on Biographies, TriviaQA, and GSM8K show notable factuality gains for open-source and smaller LLMs, with endorsement scores correlating positively with factuality and improvements persisting under various hyperparameters. The approach offers a practical, scalable solution for reducing hallucinations in real-world settings and has potential for broader application beyond the tested domains.

Abstract

This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.
Paper Structure (31 sections, 1 equation, 9 figures, 4 tables)

This paper contains 31 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The example framework of self-endorsement, where only two sampled candidates are leveraged.
  • Figure 2: Two main baselines in this work.
  • Figure 3: Statistical correlation between endorsement scores and factuality scores.
  • Figure 4: Hyperparameter analyses on LLaMA-7B-Chat (up) and LLaMA-70B-Chat (down). We present different choices of $\alpha$, $N$ and $M$ and their effects on Fact Acc. and #Fact.
  • Figure 5: Step 1 - candidate sampling. We only display 3 candidate samples here and the input prompt is highlighted in blue.
  • ...and 4 more figures