Table of Contents
Fetching ...

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

Hongshen Xu, Zichen Zhu, Situo Zhang, Da Ma, Shuai Fan, Lu Chen, Kai Yu

TL;DR

This paper addresses the problem of LLM hallucinations by reframing alignment around reliability rather than mere helpfulness. It introduces Reinforcement Learning from Knowledge Feedback (RLKF), a framework that automatically generates knowledge-boundary–aware preference data, trains a reliable reward model, and uses PPO to align the model with both accuracy and prudent refusal. Through in-domain and out-of-domain experiments on arithmetic and knowledge-based tasks (e.g., GSM8K and TriviaQA), RLKF significantly improves precision, truthfulness, and overall reliability, while reducing harmful hallucinations. The approach demonstrates that teaching models to abstain from answering unknown questions can yield practical, robust, and scalable improvements in reliability across tasks, with broader implications for trustworthy AI systems.

Abstract

Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

TL;DR

This paper addresses the problem of LLM hallucinations by reframing alignment around reliability rather than mere helpfulness. It introduces Reinforcement Learning from Knowledge Feedback (RLKF), a framework that automatically generates knowledge-boundary–aware preference data, trains a reliable reward model, and uses PPO to align the model with both accuracy and prudent refusal. Through in-domain and out-of-domain experiments on arithmetic and knowledge-based tasks (e.g., GSM8K and TriviaQA), RLKF significantly improves precision, truthfulness, and overall reliability, while reducing harmful hallucinations. The approach demonstrates that teaching models to abstain from answering unknown questions can yield practical, robust, and scalable improvements in reliability across tasks, with broader implications for trustworthy AI systems.

Abstract

Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.
Paper Structure (32 sections, 9 equations, 7 figures, 10 tables)

This paper contains 32 sections, 9 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The user cases and alignment objectives for model reliability.
  • Figure 2: Reliable preference data generation pipeline. Letters with green, red, and yellow circles denote correct, incorrect, and uncertain answers, respectively. "IDK" represents "I don't know," indicating rejections.
  • Figure 3: The results on arithmetic sub-tasks.
  • Figure 4: Rejection rate comparison.
  • Figure 5: Percentage of different response types among different arithmetic questions.
  • ...and 2 more figures