Table of Contents
Fetching ...

Distilling Large Language Models for Efficient Clinical Information Extraction

Karthik S. Vedula, Annika Gupta, Akshay Swaminathan, Ivan Lopez, Suhana Bedi, Nigam H. Shah

TL;DR

This work addresses the computational barrier of deploying clinical information extraction with large LLMs by distilling knowledge into compact BioBERT-based models trained on labels from multiple sources, including LLMs and medical ontologies. Using over 2,000 diverse clinical documents for teacher labeling, the authors evaluate 31 labeler combinations across three NER tasks and validate externally on the MedAlign dataset, demonstrating that distilled models can be 12x faster and up to 101x cheaper than state-of-the-art LLMs while achieving comparable performance on key clinical entities. The study shows distilled BioBERT models almost match human-labeled BERT performance for disease and medication extraction and offer substantial practical benefits for scalable clinical information extraction, though symptom extraction remains more challenging. By releasing distilled models and providing a framework for multi-teacher distillation, this work offers a practical roadmap for efficient, generalizable clinical NER in real-world healthcare settings.

Abstract

Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.

Distilling Large Language Models for Efficient Clinical Information Extraction

TL;DR

This work addresses the computational barrier of deploying clinical information extraction with large LLMs by distilling knowledge into compact BioBERT-based models trained on labels from multiple sources, including LLMs and medical ontologies. Using over 2,000 diverse clinical documents for teacher labeling, the authors evaluate 31 labeler combinations across three NER tasks and validate externally on the MedAlign dataset, demonstrating that distilled models can be 12x faster and up to 101x cheaper than state-of-the-art LLMs while achieving comparable performance on key clinical entities. The study shows distilled BioBERT models almost match human-labeled BERT performance for disease and medication extraction and offer substantial practical benefits for scalable clinical information extraction, though symptom extraction remains more challenging. By releasing distilled models and providing a framework for multi-teacher distillation, this work offers a practical roadmap for efficient, generalizable clinical NER in real-world healthcare settings.

Abstract

Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.
Paper Structure (8 sections, 1 figure, 13 tables)

This paper contains 8 sections, 1 figure, 13 tables.

Figures (1)

  • Figure 1: Clinical documents were passed to teacher labelers—LLMs and ontologies—for medication, symptom, and disease entity recognition tasks. We selected the optimal combination of teacher labelers based on F1 score for subsequent experiments. BERT models were distilled from the teacher labels via supervised fine-tuning and performance was measured on in-distribution datasets as well as an external validation dataset.