Table of Contents
Fetching ...

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

TL;DR

This work introduces Self-Alignment for Factuality, a framework that uses an LLM's own self-evaluation (Self-Eval) to generate internal factuality signals, augmented by SK-Tuning to improve confidence estimation and calibration, and fine-tuned with Direct Preference Optimization (DPO). By generating candidate responses, evaluating their factuality with internal knowledge, and training on self-annotated preference data, the approach significantly reduces hallucinations on knowledge-intensive tasks. Across TruthfulQA and BioGEN benchmarks, the method yields substantial gains in factual accuracy for Llama-family models, outperforming representation-editing and consistency-based baselines. The results demonstrate the value of enabling LLMs to self-assess and refine their knowledge conveyance, with implications for deploying factually reliable systems in high-stakes domains and potential integration with decoding-based strategies and larger models.

Abstract

Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. Specifically, we incorporate Self-Eval, a self-evaluation component, to prompt an LLM to validate the factuality of its own generated responses solely based on its internal knowledge. Additionally, we design Self-Knowledge Tuning (SK-Tuning) to augment the LLM's self-evaluation ability by improving the model's confidence estimation and calibration. We then utilize these self-annotated responses to fine-tune the model via Direct Preference Optimization algorithm. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks on TruthfulQA and BioGEN.

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

TL;DR

This work introduces Self-Alignment for Factuality, a framework that uses an LLM's own self-evaluation (Self-Eval) to generate internal factuality signals, augmented by SK-Tuning to improve confidence estimation and calibration, and fine-tuned with Direct Preference Optimization (DPO). By generating candidate responses, evaluating their factuality with internal knowledge, and training on self-annotated preference data, the approach significantly reduces hallucinations on knowledge-intensive tasks. Across TruthfulQA and BioGEN benchmarks, the method yields substantial gains in factual accuracy for Llama-family models, outperforming representation-editing and consistency-based baselines. The results demonstrate the value of enabling LLMs to self-assess and refine their knowledge conveyance, with implications for deploying factually reliable systems in high-stakes domains and potential integration with decoding-based strategies and larger models.

Abstract

Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. Specifically, we incorporate Self-Eval, a self-evaluation component, to prompt an LLM to validate the factuality of its own generated responses solely based on its internal knowledge. Additionally, we design Self-Knowledge Tuning (SK-Tuning) to augment the LLM's self-evaluation ability by improving the model's confidence estimation and calibration. We then utilize these self-annotated responses to fine-tune the model via Direct Preference Optimization algorithm. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks on TruthfulQA and BioGEN.
Paper Structure (45 sections, 3 equations, 6 figures, 10 tables)

This paper contains 45 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Illustration of Self-Alignment for Factuality. Given a prompt to write a biography, before factuality alignment, the LLM generates some facts that are not accurate. Through self-evaluation, the LLM is capable of identifying these inaccurate facts. The feedback from the self-evaluation is used as a reward signal to align the LLM towards factuality. Each fact is highlighted in distinct colors, and the corrected facts are marked with green letters.
  • Figure 2: A diagram illustrating the three steps of our Self-Alignment for Factuality (in long-form text generation task): $( \textup{\it i})$ Step 1: Generate initial responses for preference data collection. $( \textup{\it ii})$ Step 2: Estimate the factuality of the responses through self-evaluation for preference labeling. $( \textup{\it iii})$ Step 3: Create pairwise preference data and fine-tune the LLM using DPO.
  • Figure 3: The process of constructing training data for SK-Tuning.
  • Figure 4: Results of pairwise comparisons on BioGEN across four dimensions: factuality, helpfulness, relevance and naturalness, as evaluated by GPT-4. The left and right sections present the win rates of Self-Alignment for Factuality w/ Self-Eval-SKT against FactTune-MC and Self-Alignment for Factuality w/ Self-Eval-P(True), respectively.
  • Figure 5: Calibration curves of utilizing Self-Eval-P(True) and Self-Eval-SKT on Llama2-7B in the CommonsenseQA task. Following kadavath2022language, we plot confidence vs. frequency that a prediction is correct. The dashed line indicates perfect calibration.
  • ...and 1 more figures