Table of Contents
Fetching ...

KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance

Qihuang Zhong, Liang Ding, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao

TL;DR

Domain-specific QA with SFT often suffers from knowledge conflict between an LLM's internal knowledge and training data. The authors propose KaFT, a knowledge-aware fine-tuning framework, built on a robust query-diversification conflict detector and sample-adaptive rewards that weight training data by conflict level; this suppresses harmful signals while leveraging useful conflict information. Empirical results across multiple LLMs (LLaMA3, Qwen, Mistral) and diverse medical, multilingual, and out-of-domain benchmarks show consistent gains and reduced hallucination, with notable improvements in OOD robustness. The findings indicate KaFT’s potential to generalize beyond medical QA to broader domain-specific tasks, improving both performance and reliability of LLMs in specialized settings.

Abstract

Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs' internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs' performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.

KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance

TL;DR

Domain-specific QA with SFT often suffers from knowledge conflict between an LLM's internal knowledge and training data. The authors propose KaFT, a knowledge-aware fine-tuning framework, built on a robust query-diversification conflict detector and sample-adaptive rewards that weight training data by conflict level; this suppresses harmful signals while leveraging useful conflict information. Empirical results across multiple LLMs (LLaMA3, Qwen, Mistral) and diverse medical, multilingual, and out-of-domain benchmarks show consistent gains and reduced hallucination, with notable improvements in OOD robustness. The findings indicate KaFT’s potential to generalize beyond medical QA to broader domain-specific tasks, improving both performance and reliability of LLMs in specialized settings.

Abstract

Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs' internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs' performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.

Paper Structure

This paper contains 48 sections, 2 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Comparison between (a) vanilla SFT and (b) our KaFT. Different from vanilla SFT treating all training data equally, KaFT uses sample-adaptive rewards to facilitate more effective learning of LLMs.
  • Figure 2: (a) Illustration of distributions of $Score_i$ on MedQA across different LLMs. We use the kernel density estimate for visualizing, where the larger density refers to more training samples. (b) Performance comparison (%) of different subsets. Note that all subsets hold the same number of training samples. (c) Analysis of different proportions of wrong data. Specifically, we randomly select varied samples from wrong and merge them with the other three subsets. We use three different random seeds for data sampling and report the average results.
  • Figure 3: Effect of reward strategies in KaFT. The y-axis denotes the average performance of medical QA.
  • Figure 4: Parameter analyses of KaFT. The y-axis and x-axis denote the varied $\alpha$ and $\beta$, respectively. We report the average results on medical QA benchmarks.
  • Figure 5: Performance comparison (%) on multilingual medical QA. LLaMA3-8B is used as base model.