Table of Contents
Fetching ...

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Naghmeh Jamali, Milad Mohammadi, Danial Baledi, Zahra Rezvani, Hesham Faili

TL;DR

PerMedCQA tackles the gap in medical consumer QA by delivering the first large-scale Persian benchmark rooted in real-world, clinician-answered questions. It combines rigorous data cleaning (rule-based filtering and PII detection) with rich annotations (ICD-11 and 25 question types) and a rubric-based Med-Judge evaluation to enable nuanced open-ended QA assessment. The work benchmarks a diverse set of multilingual LLMs, explores prompt-based enhancements (pivot translation, role-based prompting, few-shot prompting), and investigates supervised fine-tuning with LoRA on smaller models, highlighting both the potential and limitations of current systems for Persian-speaking users. By making the dataset public and providing a clinically informed evaluation framework, PerMedCQA aims to advance trustworthy, culturally aware medical AI and spur further research in low-resource language healthcare NLP.

Abstract

Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

TL;DR

PerMedCQA tackles the gap in medical consumer QA by delivering the first large-scale Persian benchmark rooted in real-world, clinician-answered questions. It combines rigorous data cleaning (rule-based filtering and PII detection) with rich annotations (ICD-11 and 25 question types) and a rubric-based Med-Judge evaluation to enable nuanced open-ended QA assessment. The work benchmarks a diverse set of multilingual LLMs, explores prompt-based enhancements (pivot translation, role-based prompting, few-shot prompting), and investigates supervised fine-tuning with LoRA on smaller models, highlighting both the potential and limitations of current systems for Persian-speaking users. By making the dataset public and providing a clinically informed evaluation framework, PerMedCQA aims to advance trustworthy, culturally aware medical AI and spur further research in low-resource language healthcare NLP.

Abstract

Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA

Paper Structure

This paper contains 29 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Overview of PerMedCQA
  • Figure 2: Gender Distribution in PerMedCQA
  • Figure 3: Task instructions for ICD-11 classification and PII tagging.
  • Figure 4: Standardized ICD-11 classification codes used for QA annotation.
  • Figure 5: Distribution of ICD-11 Categories in PerMedCQA
  • ...and 12 more figures