Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng
TL;DR
This work tackles hallucination and context-m faithfulness in long-form QA by introducing GenDiE, a self-evolving framework that operates at fine-grained sentence level. It unifies generative and discriminative learning through a multi-task objective and iteratively refines training data via self-generation and self-scoring in a tree-structured data construction process. A hierarchical inference mechanism combines token-level generation with sentence-level search, enabling score-guided selection of faithful sentences. Experiments on ASQA and ConFiQA show GenDiE surpasses baselines in faithfulness and correctness, with strong robustness to domain shifts and clear benefits from sentence-level optimization and iterative self-improvement.
Abstract
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.
