Table of Contents
Fetching ...

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun, Jie Feng, Xidong Wang, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu

TL;DR

This work proposes a novel"internal-external"hybrid reward framework centered on a Contrastive Likelihood Reward (CLR), which directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence.

Abstract

With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

TL;DR

This work proposes a novel"internal-external"hybrid reward framework centered on a Contrastive Likelihood Reward (CLR), which directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence.

Abstract

With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.
Paper Structure (19 sections, 14 equations, 6 figures, 3 tables)

This paper contains 19 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The comparison between the traditional RAG RL methods (external judge signals) and our Contrastive Likelihood Rewards (CLR). All rollouts are generated using the same input; however, outcomes may vary depending on the extent to which the model utilizes the retrieved documents. A higher positive score indicates a greater degree of document utilization by the model, whereas a larger negative score indicates that the documents pose a greater burden on the model.
  • Figure 2: An example of token-level Evidential Contribution. The darker the color, the larger the absolute value of $IG_{\text{token}}(y_t)$.
  • Figure 3: The faithfulness score along with the steps.
  • Figure 4: Perplexity Length vs. Training Steps
  • Figure 5: Response Length vs. Training Steps
  • ...and 1 more figures