MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Ikram Belmadani; Oumaima El Khettari; Pacôme Constant dit Beaufils; Benoit Favre; Richard Dufour

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

TL;DR

Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits, highlighting that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

Abstract

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

TL;DR

Abstract

Paper Structure (38 sections, 2 equations, 14 figures, 7 tables)

This paper contains 38 sections, 2 equations, 14 figures, 7 tables.

Introduction
Related Work
Dataset Construction
Overview
Native Data
Translated Data
Synthetic Data
Experimental Setup
Training Configurations
Base Model and Training Procedure
Evaluation Protocol
Automatic Evaluation
Human and LLM-as-a-Judge Evaluation
Results
Multiple-Choice Question Answering
...and 23 more sections

Figures (14)

Figure 1: Sample instances from the three MedInjection-FR components. Each includes the task type (MCQ, MCQU, or OEQ) and, when applicable, its supporting context. The figure illustrates the diversity of medical reasoning tasks and prompt styles across data sources.
Figure 2: Distribution of medical specialties in MedInjection-FR (native + synthetic components). Proportions are normalized across fourteen major medical categories derived from specialty metadata.
Figure 3: Quality evaluation of synthetic data by four LLM judges. The left panel reports OEQ scores on a 1-5 scale, and the right panel reports MCQ scores on a 1-3 scale.
Figure 4: t-SNE visualization of instruction embeddings across the three data sources (native, translated, synthetic).
Figure 5: Relationship between mean output length and LLM-judge accuracy across fine-tuned models. Each point represents one configuration, with the dashed line indicating the best-fit linear trend.
...and 9 more figures

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

TL;DR

Abstract

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)