Table of Contents
Fetching ...

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

TL;DR

MedQARo addresses the lack of Romanian medical QA benchmarks by introducing a large-scale, clinically grounded dataset of 105,880 QA pairs drawn from 1,242 oncology epicrises. The work evaluates diverse LLM configurations, showing that domain- and language-specific fine-tuning substantially boosts performance, while zero-shot approaches struggle, especially under cross-domain shifts. Key contributions include the dataset with patient-level splits to prevent leakage, a comprehensive multi-model benchmark (including Romanian-adapted, long-context, and biomedical LLMs), and a detailed analysis of prompt formats and context length. The findings highlight the challenges of clinical QA in low-resource languages and underscore the potential of fine-tuned models, motivating retrieval-augmented strategies for improved generalization in Romanian medical NLP.

Abstract

Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

TL;DR

MedQARo addresses the lack of Romanian medical QA benchmarks by introducing a large-scale, clinically grounded dataset of 105,880 QA pairs drawn from 1,242 oncology epicrises. The work evaluates diverse LLM configurations, showing that domain- and language-specific fine-tuning substantially boosts performance, while zero-shot approaches struggle, especially under cross-domain shifts. Key contributions include the dataset with patient-level splits to prevent leakage, a comprehensive multi-model benchmark (including Romanian-adapted, long-context, and biomedical LLMs), and a detailed analysis of prompt formats and context length. The findings highlight the challenges of clinical QA in low-resource languages and underscore the potential of fine-tuned models, motivating retrieval-augmented strategies for improved generalization in Romanian medical NLP.

Abstract

Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

Paper Structure

This paper contains 18 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An image illustrating our dataset creation and model benchmarking stages.
  • Figure 2: Distribution of question types in the MedQARo dataset. The dataset comprises three main categories: binary (yes/no) questions, extractive questions (answers explicitly found in the epicrisis), and reasoning questions (requiring inference beyond explicit mentions). Percentages are computed over the full set of 105,880 QA pairs. Best viewed in color.
  • Figure 3: Age and gender distributions of patients included in MedQARo. Rows correspond to breast cancer, lung cancer, and other cancers, respectively. The first column shows age distributions using 5-year bins (between 20 and 94 years), while the second column row shows gender distributions. Best viewed in color.
  • Figure 4: Distribution across cancer types for patients included in the cross-domain test set of MedQARo. Best viewed in color.
  • Figure 5: Performance per question category (binary, extractive, and reasoning) across four evaluation metrics (F1, EM, BLEU, and METEOR). The values are reported for the top scoring model, namely Phi-4-mini-instruct based on 3,072 tokens, on the in-domain test set. Best viewed in color.