Table of Contents
Fetching ...

D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

Hareem Nisar, Syed Muhammad Anwar, Zhifan Jiang, Abhijeet Parida, Ramon Sanchez-Jacob, Vishwesh Nath, Holger R. Roth, Marius George Linguraru

TL;DR

D-Rax addresses the risk of hallucinations and imprecision in general vision-language medical models by introducing a domain-specific radiologic assistant for chest X-ray analysis. It builds on LLaVA-Med, enriching instruction-following data with predictions from expert models (e.g., DenseNet-based disease diagnoses, age, race, and view) derived from MIMIC-CXR and Medical-Diff-VQA, and trains a domain-focused VLM with a ViT-Large/CLIP visual encoder and a trainable projection on top of a Llama2-7B backbone. Empirical results show statistically significant improvements in abnormality and presence questions for open- and closed-ended queries, with ablations indicating robustness on extended test sets; comparisons with expert models reveal that incorporating expert predictions enhances diagnostic reasoning beyond what generic VLMs offer. The work demonstrates that domain-specific expert-guided training can reduce hallucinations and improve precision in radiologic interpretations, offering a scalable approach to augment radiology reporting and decision-making and potentially extending to other medical imaging domains.

Abstract

Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax -- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.

D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

TL;DR

D-Rax addresses the risk of hallucinations and imprecision in general vision-language medical models by introducing a domain-specific radiologic assistant for chest X-ray analysis. It builds on LLaVA-Med, enriching instruction-following data with predictions from expert models (e.g., DenseNet-based disease diagnoses, age, race, and view) derived from MIMIC-CXR and Medical-Diff-VQA, and trains a domain-focused VLM with a ViT-Large/CLIP visual encoder and a trainable projection on top of a Llama2-7B backbone. Empirical results show statistically significant improvements in abnormality and presence questions for open- and closed-ended queries, with ablations indicating robustness on extended test sets; comparisons with expert models reveal that incorporating expert predictions enhances diagnostic reasoning beyond what generic VLMs offer. The work demonstrates that domain-specific expert-guided training can reduce hallucinations and improve precision in radiologic interpretations, offering a scalable approach to augment radiology reporting and decision-making and potentially extending to other medical imaging domains.

Abstract

Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax -- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
Paper Structure (17 sections, 3 figures, 6 tables)

This paper contains 17 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of our expert vision language model D-Rax design - Training data includes multimodal data including visual information (Chest X-ray images) and textual information (VQA from radiology reports, and expert model predictions).
  • Figure 2: Qualitative evaluation: conversations provided by VLMs trained on basic and expert enhanced data. The red arrow shows the area of the pleural effusion and the yellow arrows outline the lateral margins of the enlarged heart (cardiomegaly) provided by the radiologist, which were correctly identified by D-Rax.
  • Figure 3: Data organization for expert enhanced training containing the following information: (1) image identifiers, (2) question-answer pairs, (3) diagnostic prediction on 18 medical conditions, (4) predicted age of the patient, (5) predicted race of the patient, and (6) predicted view of the CXR.