Table of Contents
Fetching ...

Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

Qihuang Zhong, Liang Ding, Fei Liao, Juhua Liu, Bo Du, Dacheng Tao

TL;DR

This work identifies knowledge conflicts as a key bottleneck in domain-specific instruction-tuning and introduces Knowledge-aware Data Selection (KDS), which quantifies conflicts via context-memory alignment (KA) and intra-memory consistency (KC) using multiple model responses and NLI-based evaluation. By applying quality and diversity filters and sampling strategically, KDS selects data that better aligns with pretrained LLM knowledge, leading to significant and consistent gains across LLaMA-3 and Qwen backbones in medical QA tasks, including reductions in hallucination. Extensive ablations show KA/KC are central to performance, with larger NLI models and carefully chosen thresholds further boosting results. The approach demonstrates improved data efficiency, multilingual generalization, and potential applicability to other domains, offering a practical DS framework for domain-specific adaptation of large language models.

Abstract

Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.

Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

TL;DR

This work identifies knowledge conflicts as a key bottleneck in domain-specific instruction-tuning and introduces Knowledge-aware Data Selection (KDS), which quantifies conflicts via context-memory alignment (KA) and intra-memory consistency (KC) using multiple model responses and NLI-based evaluation. By applying quality and diversity filters and sampling strategically, KDS selects data that better aligns with pretrained LLM knowledge, leading to significant and consistent gains across LLaMA-3 and Qwen backbones in medical QA tasks, including reductions in hallucination. Extensive ablations show KA/KC are central to performance, with larger NLI models and carefully chosen thresholds further boosting results. The approach demonstrates improved data efficiency, multilingual generalization, and potential applicability to other domains, offering a practical DS framework for domain-specific adaptation of large language models.

Abstract

Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.

Paper Structure

This paper contains 31 sections, 2 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance comparisons (%) of different DS metrics. Notably, "IFD" means the instruction-following difficulty li2024quantity, "Complexity$_{deita}$" and "Quality$_{deita}$" are from DEITA liumakes, and the metrics in red are ours. The y-axis denotes the average performance of tuned LLaMA models on several medical benchmarks, where the details are shown in Section \ref{['sec:experiments']}.
  • Figure 2: Overview of our KDS framework, which contains three processes: ❶ obtaining multiple responses of LLM for each question; ❷ scoring the data with the knowledge alignment and consistency metrics; ❸ filtering the low-quality and repetitive data, and sampling the final data. Notably, for ease of illustration, we only show a representative sample and simplified formulation in (b) and (c). $n$ denotes the number of responses for each question, $p_j=\frac{1}{n}$ is the assigned probability of $j$-th response and $p^{'}_{i}=\sum p_j$ is the sum of probabilities of $i$-th cluser.
  • Figure 3: Distributions of quality score measured by different base LLMs. The x-axis denotes the measured quality score, and the y-axis denotes the number of data points.
  • Figure 4: Comparative winning rates (%) of KDS-KA+KC v.s. other baselines on the long-form medical QA benchmarkhosseini2024benchmark. LLaMA-3-8B-Instruct is used as the base model, and GPT-4o-mini is used as the automated evaluator.
  • Figure 5: (a) Effect of NLI models with different model sizes, (b) Parameter analysis of quality threshold $\tau$ and (c) Parameter analysis of diversity threshold $\lambda$. Notably, we use the LLaMA-3-8B-Instruct as the base model and report the average performance of HoT and multiple-choice QA benchmarks.
  • ...and 2 more figures