MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz; Radu Tudor Ionescu; Alexandra-Valentina Anghel; Ionut-Lucian Antone-Iordache; Simona Coniac; Andreea Iuliana Ionescu

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

TL;DR

MedQARo addresses the lack of Romanian medical QA benchmarks by introducing a large-scale, clinically grounded dataset of 105,880 QA pairs drawn from 1,242 oncology epicrises. The work evaluates diverse LLM configurations, showing that domain- and language-specific fine-tuning substantially boosts performance, while zero-shot approaches struggle, especially under cross-domain shifts. Key contributions include the dataset with patient-level splits to prevent leakage, a comprehensive multi-model benchmark (including Romanian-adapted, long-context, and biomedical LLMs), and a detailed analysis of prompt formats and context length. The findings highlight the challenges of clinical QA in low-resource languages and underscore the potential of fine-tuned models, motivating retrieval-augmented strategies for improved generalization in Romanian medical NLP.

Abstract

Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

TL;DR

Abstract

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)