Table of Contents
Fetching ...

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, Matthias Keicher

TL;DR

RaDialog tackles the challenge of clinically correct radiology report generation combined with interactive dialog by introducing a dual-branch LVLM that fuses image features and explicit structured findings with a tuned LLM via LoRA. A semi-automatic image-grounded instruct dataset (~580k samples across ten tasks) enables domain-specific dialog capabilities while mitigating catastrophic forgetting through replay and context dropping. The model demonstrates state-of-the-art clinical correctness in report generation and strong performance across interactive tasks such as report correction and findings QA, with radiologists preferring RaDialog over baselines. These results suggest RaDialog as a viable foundation for clinical radiology dialog systems, offering faster inference, robust multi-task capabilities, and a public dataset to spur further research and adoption.

Abstract

Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. RaDialog effectively integrates visual image features and structured pathology findings with a large language model (LLM) while simultaneously adapting it to a specialized domain using parameter-efficient fine-tuning. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems. Our code is available on github: https://github.com/ChantalMP/RaDialog.

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

TL;DR

RaDialog tackles the challenge of clinically correct radiology report generation combined with interactive dialog by introducing a dual-branch LVLM that fuses image features and explicit structured findings with a tuned LLM via LoRA. A semi-automatic image-grounded instruct dataset (~580k samples across ten tasks) enables domain-specific dialog capabilities while mitigating catastrophic forgetting through replay and context dropping. The model demonstrates state-of-the-art clinical correctness in report generation and strong performance across interactive tasks such as report correction and findings QA, with radiologists preferring RaDialog over baselines. These results suggest RaDialog as a viable foundation for clinical radiology dialog systems, offering faster inference, robust multi-task capabilities, and a public dataset to spur further research and adoption.

Abstract

Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. RaDialog effectively integrates visual image features and structured pathology findings with a large language model (LLM) while simultaneously adapting it to a specialized domain using parameter-efficient fine-tuning. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems. Our code is available on github: https://github.com/ChantalMP/RaDialog.
Paper Structure (20 sections, 1 equation, 5 figures, 8 tables)

This paper contains 20 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Pipeline overview: The Image Encoder extracts X-ray features and transforms them via adapter module a or b. The Structured Findings Extractor extracts high-level findings. Both outputs are integrated during Prompt Construction with conversation history and task-specific instructions to query the LLM. The predicted answer are added to the conversation history.
  • Figure 2: Qualitative report generation results of RaDialogproject (top) and RaDialogalign (bottom). Colors indicate matching findings in ground truth and prediction.
  • Figure 3: Qualitative conversation examples with RaDialogproject-ins (left) and RaDialogalign-ins (right), showing examples of correction, knowledge QA (zero-shot), easy language, and translation (zero-shot).
  • Figure 4: Qualitative report generation comparison of RaDialog with XrayGPT and GPT4-Vision.
  • Figure 5: Differences in conversation behavior of RaDialog-align-instruct and RaDialog-project-instruct in zero-shot conversational tasks.