Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

Tobias Kerner

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

Tobias Kerner

TL;DR

The paper investigates domain-specific pretraining for medical LLMs and compares it to general-purpose and mixed-domain approaches. It argues that domain-focused data can enable smaller models to achieve competitive performance, with advantages for privacy and local deployment. The study surveys datasets, describes how to create domain-specific datasets, and presents benchmark results showing that models like BioMedLM, Apollo, and HEAL can outperform larger general models on medical tasks. Practically, the findings support using domain-specific or mixed-domain pretraining to enable local, privacy-preserving inference with modest hardware while maintaining strong medical performance.

Abstract

There are many cases where LLMs are used for specific tasks in a single domain. These usually require less general, but more domain-specific knowledge. Highly capable, general-purpose state-of-the-art language models like GPT-4 or Claude-3-opus can often be used for such tasks, but they are very large and cannot be run locally, even if they were not proprietary. This can be a problem when working with sensitive data. This paper focuses on domain-specific and mixed-domain pretraining as potentially more efficient methods than general pretraining for specialized language models. We will take a look at work related to domain-specific pretraining, specifically in the medical area, and compare benchmark results of specialized language models to general-purpose language models.

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Introduction
Pretraining
General Pretraining
Domain-Specific Pretraining
Mixed-Domain Pretraining
Datasets
General Datasets
Medical Datasets Overview
Creating Domain-Specific Datasets
Performance of specialized LLMs
Domain-Specific Pretrained Models
Mixed-Domain Pretrained
Benchmark Comparison
Comparing Model-Size to Benchmark Score
Further Resource-Optimization
...and 1 more sections

Figures (4)

Figure 1: Processing of CommonCrawl Dataset for RefinedWeb, taken from NEURIPS2023_fa3ed726 and edited for simplification.
Figure 2: Comparison of models on medical benchmarks, sorted by model size. The number in brackets behind the score represents the evaluation method (x-shot). f stands for finetune, the model was finetuned for this task. if the reference paper does not provide the evaluation method or specifies 'few-shot', there brackets will contain a '?'.
Figure 3: Comparison of model scores on PubMedQA and MedMCQA. Circle size is logarithmically proportional to the number of tokens the model was trained with. Number of training tokens for general models taken from brown2020languagemodelsfewshotlearnerssubstackGPT4Detailstouvron2023llama2openfoundation.
Figure 4: Comparison of model scores on MedMCQA with a trend line based on general pretrained model data. Circle size is logarithmically proportional to the number of tokens the model was trained with. Trend line function: 9.226*ln(24.64*x+665.24)-25.59

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

TL;DR

Abstract

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

Authors

TL;DR

Abstract

Table of Contents

Figures (4)