Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise
Qimin Yang, Rongsheng Wang, Jiexin Chen, Runqi Su, Tao Tan
TL;DR
This work investigates why medical LLMs', fine-tuned for domain knowledge, long-context understanding often deteriorates and proposes a data-centric approach to balance general and medical information during fine-tuning. Using an open-book exam framework on Chinese medical tests and diverse data sources, the study analyzes how data composition and volume affect contextual reading and instruction following. Key findings show that incorporating general data improves long-context abilities in medical LLMs, though excessive domain-focused data can hinder context tracking, and that data volume exhibits diminishing returns after a saturation point. The results provide practical guidance for fine-tuning strategies to preserve broad linguistic competence while enhancing domain-specific accuracy, enabling more robust medical dialogue and decision-support applications.
Abstract
Large Language Models (LLMs) have been widely applied in various professional fields. By fine-tuning the models using domain specific question and answer datasets, the professional domain knowledge and Q\&A abilities of these models have significantly improved, for example, medical professional LLMs that use fine-tuning of doctor-patient Q\&A data exhibit extraordinary disease diagnostic abilities. However, we observed that despite improvements in specific domain knowledge, the performance of medical LLM in long-context understanding has significantly declined, especially compared to general language models with similar parameters. The purpose of this study is to investigate the phenomenon of reduced performance in understanding long-context in medical LLM. We designed a series of experiments to conduct open-book professional knowledge exams on all models to evaluate their ability to read long-context. By adjusting the proportion and quantity of general data and medical data in the process of fine-tuning, we can determine the best data composition to optimize the professional model and achieve a balance between long-context performance and specific domain knowledge.
