Table of Contents
Fetching ...

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

TL;DR

Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits, highlighting that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

Abstract

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

TL;DR

Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits, highlighting that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

Abstract

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.
Paper Structure (38 sections, 2 equations, 14 figures, 7 tables)

This paper contains 38 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Sample instances from the three MedInjection-FR components. Each includes the task type (MCQ, MCQU, or OEQ) and, when applicable, its supporting context. The figure illustrates the diversity of medical reasoning tasks and prompt styles across data sources.
  • Figure 2: Distribution of medical specialties in MedInjection-FR (native + synthetic components). Proportions are normalized across fourteen major medical categories derived from specialty metadata.
  • Figure 3: Quality evaluation of synthetic data by four LLM judges. The left panel reports OEQ scores on a 1-5 scale, and the right panel reports MCQ scores on a 1-3 scale.
  • Figure 4: t-SNE visualization of instruction embeddings across the three data sources (native, translated, synthetic).
  • Figure 5: Relationship between mean output length and LLM-judge accuracy across fine-tuned models. Each point represents one configuration, with the dashed line indicating the best-fit linear trend.
  • ...and 9 more figures