Table of Contents
Fetching ...

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang

TL;DR

The paper tackles hallucination and factuality in long-form LLM generation by introducing Knowledge-Level Consistency Reinforcement Learning (KLCF), which aligns the model's expressed knowledge with its pre-trained parametric knowledge through a dual-fact alignment mechanism. It integrates offline data preparation to build a factual checklist and a self-assessed truthfulness reward, enabling online reinforcement learning without external retrieval and using Group Relative Policy Optimization. Empirical results across multiple long-form benchmarks show improved factuality, with gains in recall and precision and robust scalability from 7B to 32B models, while maintaining efficiency. Limitations include a closed-book setting and lack of intermediate-step supervision, with future work proposing step-wise factual alignment and potential real-time search integration to further enhance coverage and accuracy.

Abstract

Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model's internal knowledge boundaries, exacerbating the so-called "hallucination tax". To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model's expressed knowledge and the base model's parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model's internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

TL;DR

The paper tackles hallucination and factuality in long-form LLM generation by introducing Knowledge-Level Consistency Reinforcement Learning (KLCF), which aligns the model's expressed knowledge with its pre-trained parametric knowledge through a dual-fact alignment mechanism. It integrates offline data preparation to build a factual checklist and a self-assessed truthfulness reward, enabling online reinforcement learning without external retrieval and using Group Relative Policy Optimization. Empirical results across multiple long-form benchmarks show improved factuality, with gains in recall and precision and robust scalability from 7B to 32B models, while maintaining efficiency. Limitations include a closed-book setting and lack of intermediate-step supervision, with future work proposing step-wise factual alignment and potential real-time search integration to further enhance coverage and accuracy.

Abstract

Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model's internal knowledge boundaries, exacerbating the so-called "hallucination tax". To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model's expressed knowledge and the base model's parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model's internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

Paper Structure

This paper contains 36 sections, 16 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Conceptual illustration of the KLCF alignment motivation. Conventional long-form factuality alignment methods reduce hallucinations by shrinking the Expressed Knowledge at the cost of coverage, while KLCF aims to expand the intersection of the Expressed Knowledge and the Parametric knowledge acquired from pre-training.
  • Figure 2: KLCF framework (Left) vs. Previous work (Right). Unlike previous methods that rely on costly online external knowledge retrieval for real-time verification, our framework achieves dual-fact alignment through knowledge-level consistency rewards, which are computed efficiently using offline-prepared resources—a factual checklist and a truthfulness reward model—enabling scalable RL training without external dependencies.
  • Figure 3: Offline data preparation pipeline. The process constructs the essential resources for knowledge-level consistency rewards—a factual checklist and truthfulness reward model training data—by extracting and verifying claims from the base model's responses.
  • Figure 4: Training Dynamics of KLCF-zero on Qwen2.5-14B. The figure illustrates the progression of key metrics throughout the reinforcement learning process. (a) The core knowledge-level consistency rewards, all showing significant improvement. (b) The auxiliary rewards guiding response quality and structure. (c) The KL divergence, measuring the deviation from the base model. (d) The entropy loss, reflecting the policy's exploration-exploitation balance. (e) The lengths of the generated responses and the internal reasoning chains.
  • Figure 5: Training Dynamics of KLCF-zero on Qwen2.5-7B.
  • ...and 1 more figures