Table of Contents
Fetching ...

Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

Valentina Carbonari, Pierangelo Veltri, Pietro Hiram Guzzi

TL;DR

This survey addresses the challenge of diagnosing rare diseases by examining how large language models (LLMs) are applied to biomedical text, case reports, and related knowledge sources. It presents a PRISMA-guided map of the literature, classifies studies along dimensions such as disease focus, data modalities, prompting strategies, and integration within pipelines, and highlights four disease-specific case studies. The paper discusses datasets, ontologies, and corpora foundational to biomedical NLP, and analyzes challenges including data scarcity, bias, privacy, and explainability. It further outlines a roadmap toward multimodal LLMs that fuse genetic, imaging, and EHR data, supported by synthetic data augmentation, retrieval-augmented generation, and domain-specific pre-training, with an emphasis on governance and cross-disciplinary collaboration to translate advances into clinical practice.

Abstract

Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.

Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

TL;DR

This survey addresses the challenge of diagnosing rare diseases by examining how large language models (LLMs) are applied to biomedical text, case reports, and related knowledge sources. It presents a PRISMA-guided map of the literature, classifies studies along dimensions such as disease focus, data modalities, prompting strategies, and integration within pipelines, and highlights four disease-specific case studies. The paper discusses datasets, ontologies, and corpora foundational to biomedical NLP, and analyzes challenges including data scarcity, bias, privacy, and explainability. It further outlines a roadmap toward multimodal LLMs that fuse genetic, imaging, and EHR data, supported by synthetic data augmentation, retrieval-augmented generation, and domain-specific pre-training, with an emphasis on governance and cross-disciplinary collaboration to translate advances into clinical practice.

Abstract

Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.

Paper Structure

This paper contains 14 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Flowchart of the diagnostic journey for rare disease patients. The path often includes multiple incorrect diagnoses, specialist referrals, and redundant testing before the correct diagnosis is reached, often several years after symptom onset.
  • Figure 2: PRISMA flow diagram illustrating the systematic review process. Searches included combinations of "LLM", "Rare Diseases", "Data Analysis", and "Questionnaires" - (2022-2024).