Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model
Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli
TL;DR
This work addresses the data scarcity for Persian medical NLP by constructing open resources—a 90-million-token Persian medical corpus and MF3QA 20k Q&A dataset—and fine-tuning a small Persian LLM (aya-expanse-8b) to Gaokerena-V using LoRA. The two-stage training (fine-tuning then instruction-tuning) yields a model that passes the Iranian Basic Medical Sciences Entrance Exam and improves Persian MMLU performance, outperforming general-purpose baselines and avoiding the latency issues of pipeline translations. The study demonstrates that open online medical data can meaningfully enhance small Persian models for healthcare tasks in resource-constrained settings, with potential extensions to multimodal inputs and clinical validation. This has practical implications for privacy-preserving, on-device medical AI in Persian-speaking contexts and sets a foundation for future RLHF and multimodal work.
Abstract
The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.
