Table of Contents
Fetching ...

GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo Bai, Dahua Lin, Lijun Wu, Conghui He

TL;DR

This paper introduces GRAIT, a gradient-driven Refusal-Aware Instruction Tuning framework, to mitigate hallucinations in LLMs while avoiding over-refusal. By deriving two gradient-based observations ($O_1$ and $O_2$) and separating data into ik and idk sets, GRAIT performs three stages: constructing ik/idk datasets, selecting influential idk samples, and applying stable influence-based weighting during fine-tuning. Empirical results on MCQA and OEQA tasks show GRait outperforms prior RAIT methods in both reducing hallucinations (lower $P_w$) and improving overall effectiveness (higher THS) across ID and OOD evaluations. The work advances safe, reliable LLM behavior with relatively data-efficient gradient-guided training, and suggests directions for modeling knowledge boundaries and dynamic gradient influences in future research.

Abstract

Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .

GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

TL;DR

This paper introduces GRAIT, a gradient-driven Refusal-Aware Instruction Tuning framework, to mitigate hallucinations in LLMs while avoiding over-refusal. By deriving two gradient-based observations ( and ) and separating data into ik and idk sets, GRAIT performs three stages: constructing ik/idk datasets, selecting influential idk samples, and applying stable influence-based weighting during fine-tuning. Empirical results on MCQA and OEQA tasks show GRait outperforms prior RAIT methods in both reducing hallucinations (lower ) and improving overall effectiveness (higher THS) across ID and OOD evaluations. The work advances safe, reliable LLM behavior with relatively data-efficient gradient-guided training, and suggests directions for modeling knowledge boundaries and dynamic gradient influences in future research.

Abstract

Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .

Paper Structure

This paper contains 40 sections, 14 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Descriptions of C1 & C2. After RAIT, the initial LLM model will largely reject unknown questions to avoid errors. However, the overly conservative nature of RAIT also led to a decrease in accuracy.
  • Figure 2: Case of mitigating hallucination and avoiding over-refusal.
  • Figure 3: Overview of our framework.GRait contains three stages: (1) Constructing datasets $\mathbf{D_{\text{ik}}}$ and $\mathbf{D_{\text{idk}}}$ by querying the internal state of LLMs. (2) Distilling the datasets to select idk samples based on the first observation $\mathbf{O}_1$. (3) Performing Influence-directed Refusal-aware Instruction Tuning using the second observation $\mathbf{O}_2$.
  • Figure 4: Illustration of Truthful Helpfulness Score.
  • Figure 5: Relationship between $\mathcal{I}^{\text{ref}}$ and $\mathcal{I}^{\text{over}}$ in MMLU performance on LLaMA2-7B-Chat and LLaMA3-8B-Instruct.
  • ...and 1 more figures