Table of Contents
Fetching ...

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

Karthik Soman, Andrew Langdon, Catalina Villouta, Chinmay Agrawal, Lashaw Salta, Braian Peetoom, Gianmarco Bellucci, Orion J Buske

TL;DR

Zebra-Llama is presented, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as a case study, which demonstrates unprecedented capabilities in handling EDS-related queries.

Abstract

Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

TL;DR

Zebra-Llama is presented, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as a case study, which demonstrates unprecedented capabilities in handling EDS-related queries.

Abstract

Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.

Paper Structure

This paper contains 24 sections, 5 figures.

Figures (5)

  • Figure 1: Fig 1: Comparison of different approaches to EDS-related query handling. (A) Base Llama model generating answers without RAG context, resulting in potentially inaccurate information and hallucinated citations. (B) Base Llama model with RAG implementation, showing imprecise utilization of retrieved context and inclusion of irrelevant information and citations. (C) Zebra-Llama model demonstrating enhanced context-aware RAG capabilities, generating precise responses with accurate citations derived specifically from relevant portions of the retrieved context. The color-coding indicates the relevance of retrieved and generated information (green: relevant, red: non-relevant), highlighting Zebra-Llama's improved ability to focus on pertinent information while generating responses.
  • Figure 2: Fig 2: Training and Inference Phases of Zebra-Llama (A) Training Phase: Data from PubMed, Inspire, and Reddit undergoes transformation into structured (Q, C, A) format. The data transformation process is shown in the insight. This processed data is then used for context-aware fine-tuning of Llama-3.1-8B-Instruct model using LoRA. (B) Inference Phase: A user prompt (Q) triggers retrieval of semantically similar documents from the Pinecone vector database, forming the context (C). The fine-tuned Llama 3.1 model then generates the output (A) by processing the concatenated user prompt and retrieved context.
  • Figure 3: Fig 3: EDS domain specificity evaluation through similarity score distribution and classification performance (A) Distribution of similarity scores for EDS-related (orange) and non-EDS (blue) questions, demonstrating distinct semantic patterns with example queries and responses. EDS questions cluster around higher similarity scores (0.85 ± 0.02), while non-EDS questions show a broader distribution (0.79 ± 0.05). (B) Precision-Recall curve for the EDS semantic classifier, achieving an optimal threshold of 0.81 with high recall (0.98) and precision (0.74). The classifier substantially outperforms the no-skill baseline (AP = 0.86), indicating robust discrimination between EDS and non-EDS queries.
  • Figure 4: Fig 4: Comprehensive evaluation of Zebra-Llama's performance (A) Expert manual evaluation comparing performance metrics between Zebra-Llama and base-Llama, showing improvements in thoroughness (77.5% vs 70.1%), accuracy (83.0% vs 78.8%), and clarity (74.7% vs 72.0%). Error bars indicate s.e.m (B) Correlation analysis between manual expert evaluation and automated GPT-4 assessment, demonstrating moderate agreement (agreement is quantified using Intraclass Correlation Coefficient-ICC and "r" denotes Pearson correlation coefficient) across all metrics (Thoroughness: ICC = 0.675, r = 0.687; Accuracy: ICC = 0.576, r = 0.581; Clarity: ICC = 0.608, r = 0.610). (C) Per-response citation average accuracy comparison, showing Zebra-Llama's superior performance (70.4% ± 5.4%) compared to base-Llama (52.3% ± 5.9%). Error bars indicate s.e.m (D) Percentage of responses with all correct citations, with Zebra-Llama (68.2%) outperforming base-Llama (51.4%), indicating improved citation reliability.
  • Figure 5: Citation performance validation on unseen RAG contexts (A) Per-response citation accuracy comparison between Zebra-Llama (82.1% ± 9.6%) and base-Llama (75.0% ± 9.8%) on test questions with entirely unseen contexts (metric is given as mean ± sem) (B) Percentage of responses with all citations correct, showing Zebra-Llama (78.6%) maintaining superior performance over base-Llama (64.3%) when evaluated on novel contexts. These results validate that Zebra-Llama's enhanced citation capabilities persist even when handling previously unseen RAG context.