Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field
Tobias Kerner
TL;DR
The paper investigates domain-specific pretraining for medical LLMs and compares it to general-purpose and mixed-domain approaches. It argues that domain-focused data can enable smaller models to achieve competitive performance, with advantages for privacy and local deployment. The study surveys datasets, describes how to create domain-specific datasets, and presents benchmark results showing that models like BioMedLM, Apollo, and HEAL can outperform larger general models on medical tasks. Practically, the findings support using domain-specific or mixed-domain pretraining to enable local, privacy-preserving inference with modest hardware while maintaining strong medical performance.
Abstract
There are many cases where LLMs are used for specific tasks in a single domain. These usually require less general, but more domain-specific knowledge. Highly capable, general-purpose state-of-the-art language models like GPT-4 or Claude-3-opus can often be used for such tasks, but they are very large and cannot be run locally, even if they were not proprietary. This can be a problem when working with sensitive data. This paper focuses on domain-specific and mixed-domain pretraining as potentially more efficient methods than general pretraining for specialized language models. We will take a look at work related to domain-specific pretraining, specifically in the medical area, and compare benchmark results of specialized language models to general-purpose language models.
