Table of Contents
Fetching ...

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli

TL;DR

This work addresses the data scarcity for Persian medical NLP by constructing open resources—a 90-million-token Persian medical corpus and MF3QA 20k Q&A dataset—and fine-tuning a small Persian LLM (aya-expanse-8b) to Gaokerena-V using LoRA. The two-stage training (fine-tuning then instruction-tuning) yields a model that passes the Iranian Basic Medical Sciences Entrance Exam and improves Persian MMLU performance, outperforming general-purpose baselines and avoiding the latency issues of pipeline translations. The study demonstrates that open online medical data can meaningfully enhance small Persian models for healthcare tasks in resource-constrained settings, with potential extensions to multimodal inputs and clinical validation. This has practical implications for privacy-preserving, on-device medical AI in Persian-speaking contexts and sets a foundation for future RLHF and multimodal work.

Abstract

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

TL;DR

This work addresses the data scarcity for Persian medical NLP by constructing open resources—a 90-million-token Persian medical corpus and MF3QA 20k Q&A dataset—and fine-tuning a small Persian LLM (aya-expanse-8b) to Gaokerena-V using LoRA. The two-stage training (fine-tuning then instruction-tuning) yields a model that passes the Iranian Basic Medical Sciences Entrance Exam and improves Persian MMLU performance, outperforming general-purpose baselines and avoiding the latency issues of pipeline translations. The study demonstrates that open online medical data can meaningfully enhance small Persian models for healthcare tasks in resource-constrained settings, with potential extensions to multimodal inputs and clinical validation. This has practical implications for privacy-preserving, on-device medical AI in Persian-speaking contexts and sets a foundation for future RLHF and multimodal work.

Abstract

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.

Paper Structure

This paper contains 15 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Percentage of content from each magazine crawled for the corpus
  • Figure 2: Percentage of content from each forums crawled for MF3QA dataset
  • Figure 3: Training loss curve during the fine-tuning stage.
  • Figure 4: Training loss curve during the instruction-tuning stage.
  • Figure 5: Our model win rate against general purpose language models
  • ...and 1 more figures