Table of Contents
Fetching ...

Assessing and Enhancing Large Language Models in Rare Disease Question-answering

Guanchu Wang, Junhao Ran, Ruixiang Tang, Chia-Yuan Chang, Chia-Yuan Chang, Yu-Neng Chuang, Zirui Liu, Vladimir Braverman, Zhandong Liu, Xia Hu

TL;DR

This work tackles the challenge of diagnosing rare diseases with large language models by introducing ReDis-QA, a dataset of 1360 QA pairs across 205 rare diseases, and ReCOP, the first open-source corpus constructed from the NORD reports to support retrieval-augmented generation. The authors benchmark open-source LLMs on ReDis-QA and demonstrate significant performance gaps, particularly on complex properties like related disorders and diagnosis. By organizing ReCOP into seven disease-specific chunks (overview, symptoms, causes, effects, related disorders, diagnosis, therapies) and combining it with retrieval strategies, they achieve an average accuracy improvement of $8\%$ and improved explainability that traces back to existing literature. The open-source nature of ReDis-QA and ReCOP, along with their demonstrated effectiveness for RAG, offers a practical path toward more trustworthy, literature-grounded diagnostic support for rare diseases in clinical and research settings.

Abstract

Despite the impressive capabilities of Large Language Models (LLMs) in general medical domains, questions remain about their performance in diagnosing rare diseases. To answer this question, we aim to assess the diagnostic performance of LLMs in rare diseases, and explore methods to enhance their effectiveness in this area. In this work, we introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of LLMs in diagnosing rare diseases. Specifically, we collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. Additionally, we annotated meta-data for each question, facilitating the extraction of subsets specific to any given disease and its property. Based on the ReDis-QA dataset, we benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. To facilitate retrieval augmentation generation for rare disease diagnosis, we collect the first rare diseases corpus (ReCOP), sourced from the National Organization for Rare Disorders (NORD) database. Specifically, we split the report of each rare disease into multiple chunks, each representing a different property of the disease, including their overview, symptoms, causes, effects, related disorders, diagnosis, and standard therapies. This structure ensures that the information within each chunk aligns consistently with a question. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%. Moreover, it significantly guides LLMs to generate trustworthy answers and explanations that can be traced back to existing literature.

Assessing and Enhancing Large Language Models in Rare Disease Question-answering

TL;DR

This work tackles the challenge of diagnosing rare diseases with large language models by introducing ReDis-QA, a dataset of 1360 QA pairs across 205 rare diseases, and ReCOP, the first open-source corpus constructed from the NORD reports to support retrieval-augmented generation. The authors benchmark open-source LLMs on ReDis-QA and demonstrate significant performance gaps, particularly on complex properties like related disorders and diagnosis. By organizing ReCOP into seven disease-specific chunks (overview, symptoms, causes, effects, related disorders, diagnosis, therapies) and combining it with retrieval strategies, they achieve an average accuracy improvement of and improved explainability that traces back to existing literature. The open-source nature of ReDis-QA and ReCOP, along with their demonstrated effectiveness for RAG, offers a practical path toward more trustworthy, literature-grounded diagnostic support for rare diseases in clinical and research settings.

Abstract

Despite the impressive capabilities of Large Language Models (LLMs) in general medical domains, questions remain about their performance in diagnosing rare diseases. To answer this question, we aim to assess the diagnostic performance of LLMs in rare diseases, and explore methods to enhance their effectiveness in this area. In this work, we introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of LLMs in diagnosing rare diseases. Specifically, we collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. Additionally, we annotated meta-data for each question, facilitating the extraction of subsets specific to any given disease and its property. Based on the ReDis-QA dataset, we benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. To facilitate retrieval augmentation generation for rare disease diagnosis, we collect the first rare diseases corpus (ReCOP), sourced from the National Organization for Rare Disorders (NORD) database. Specifically, we split the report of each rare disease into multiple chunks, each representing a different property of the disease, including their overview, symptoms, causes, effects, related disorders, diagnosis, and standard therapies. This structure ensures that the information within each chunk aligns consistently with a question. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%. Moreover, it significantly guides LLMs to generate trustworthy answers and explanations that can be traced back to existing literature.
Paper Structure (22 sections, 10 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Pipeline of building the rare disease QA (ReDis-QA) dataset: data collection, cleaning, and labeling.
  • Figure 2: (a) Top-50 rare diseases in the ReDis-QA datasets. (b) Ratios of questions corresponding to the symptoms, causes, effects, related disorders, diagnosis, and others properties in the ReDis-QA datasets. (c) Benchmark results of LLMs on the ReDis-QA datasets with accuracy for each subset of properties displayed separately.
  • Figure 3: Pipeline of building the rare disease corpus (ReCOP): data collection and chunking.
  • Figure 4: (a)-(d) Accuracy of LLMs with and without ReCOP on the six subsets of disease properties. (e)-(h) Accuracy of LLMs without ReCOP, LLMs with RAG using baseline corpus, and LLMs with RAG using baseline corpus and ReCOP, where baseline corpous take Textbooks (e), StatPearls (f), PubMed (g), and Wikipedia (h); the retriever takes BM25 with $k=7$.
  • Figure 5: Explanations provided by LLMs for the answers.
  • ...and 5 more figures