Table of Contents
Fetching ...

NurseLLM: The First Specialized Language Model for Nursing

Md Tawkat Islam Khondaker, Julia Harrington, Shady Shehata

TL;DR

NurseLLM introduces a nursing-specific large language model designed for NCLEX-style MCQ question answering, addressing the gap left by general medical LLMs. The authors build a large-scale NCLEX-equivalent QA dataset (125K samples) via a multi-stage data-generation pipeline, coupled with three nursing benchmarks (NCLEX-Test, GPT4o-Test, MultiNurseQA) and a decontamination protocol to ensure evaluation integrity. Finetuned from a medical-specialized base model (Med42) using QLoRA with 4-bit quantization and merged back via MergeKit to preserve general medical knowledge, NurseLLM outperforms open-source baselines on nursing MCQs and approaches state-of-the-art in reasoning-enabled setups, while showing competitive performance on generic medical benchmarks. The work also demonstrates the potential of reasoning-focused data and multi-agent collaboration (MAS) in nursing AI, and discusses data quality, ethics, and deployment considerations for domain-specific LLMs. Overall, the results underscore the value of domain specialization in nursing AI and point to promising future directions in reasoning-driven supervision and collaborative agent frameworks for clinical education and decision support.

Abstract

Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.

NurseLLM: The First Specialized Language Model for Nursing

TL;DR

NurseLLM introduces a nursing-specific large language model designed for NCLEX-style MCQ question answering, addressing the gap left by general medical LLMs. The authors build a large-scale NCLEX-equivalent QA dataset (125K samples) via a multi-stage data-generation pipeline, coupled with three nursing benchmarks (NCLEX-Test, GPT4o-Test, MultiNurseQA) and a decontamination protocol to ensure evaluation integrity. Finetuned from a medical-specialized base model (Med42) using QLoRA with 4-bit quantization and merged back via MergeKit to preserve general medical knowledge, NurseLLM outperforms open-source baselines on nursing MCQs and approaches state-of-the-art in reasoning-enabled setups, while showing competitive performance on generic medical benchmarks. The work also demonstrates the potential of reasoning-focused data and multi-agent collaboration (MAS) in nursing AI, and discusses data quality, ethics, and deployment considerations for domain-specific LLMs. Overall, the results underscore the value of domain specialization in nursing AI and point to promising future directions in reasoning-driven supervision and collaborative agent frameworks for clinical education and decision support.

Abstract

Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.

Paper Structure

This paper contains 31 sections, 10 figures, 29 tables.

Figures (10)

  • Figure 1: Overall methodology of NurseLLM data collection and training.
  • Figure 2: Example of a sample taxonomy (Specialization$\rightarrow$Domain$\rightarrow$Topic$\rightarrow$Concept) for the specialization of Clinical Skills and Knowledge.
  • Figure 3: Prompt template used for synthetic data generation.
  • Figure 4: Performance of the models on the NCLEX-Test Benchmark.
  • Figure 5: Performance of the models on GPT4o-Test Benchmark.
  • ...and 5 more figures