Table of Contents
Fetching ...

Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng

TL;DR

This work tackles hallucination and context-m faithfulness in long-form QA by introducing GenDiE, a self-evolving framework that operates at fine-grained sentence level. It unifies generative and discriminative learning through a multi-task objective and iteratively refines training data via self-generation and self-scoring in a tree-structured data construction process. A hierarchical inference mechanism combines token-level generation with sentence-level search, enabling score-guided selection of faithful sentences. Experiments on ASQA and ConFiQA show GenDiE surpasses baselines in faithfulness and correctness, with strong robustness to domain shifts and clear benefits from sentence-level optimization and iterative self-improvement.

Abstract

Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.

Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

TL;DR

This work tackles hallucination and context-m faithfulness in long-form QA by introducing GenDiE, a self-evolving framework that operates at fine-grained sentence level. It unifies generative and discriminative learning through a multi-task objective and iteratively refines training data via self-generation and self-scoring in a tree-structured data construction process. A hierarchical inference mechanism combines token-level generation with sentence-level search, enabling score-guided selection of faithful sentences. Experiments on ASQA and ConFiQA show GenDiE surpasses baselines in faithfulness and correctness, with strong robustness to domain shifts and clear benefits from sentence-level optimization and iterative self-improvement.

Abstract

Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of GenDiE: (a) Pre-stage (§\ref{['sec: per-stage-data-construction']}) uses gold answer sentences from a seed dataset as target faithful instances, while filtered self-generated sentences---produced without access to supporting passages---serve as negative samples. (b) Self-evolving stages (§\ref{['sec: self-evolving']}) leverage models from previous iteration for both self-generation and self-scoring, constructing training datasets via tree-structured sampling. Throughout all stages of the self-evolving framework, both language modeling loss (optimizing towards the target instances $a$) and discrimination loss (assigning higher faithfulness scores to $a$ over $a'$) are incorporated (§\ref{['sec:training']}).
  • Figure 2: Performance comparisons between GenDiE and GenDiEgold-answer across iterations.
  • Figure 3: The comparisons between GenDiE and GenDiEanswer-level across iterations.