MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation
Ling Yang, Zhanyu Wang, Zhenghao Chen, Xinyu Liang, Luping Zhou
TL;DR
This work tackles the gap in unified multimodal large language models for medical imaging by introducing MedXChat, a framework that jointly handles CXR understanding and generation. It combines a CLIP-based visual encoder with an instruction-tuned LLM (via LoRA) and a diffusion-based CXR generator (CXR-SD), guided by an instruction dataset constructed from ChatGPT-4 prompts applied to MIMIC-CXR data. MedXChat supports three tasks—CXR-to-Report, CXR-VQA, and Text-to-CXR—demonstrating strong cross-task performance and the ability to produce clinically meaningful reports, answers, and high-fidelity CXR images, including lateral views. The authors release their instruction data and CXR-SD-tuned model to promote reproducibility and future research, underscoring the practical impact of unified, clinic-friendly AI systems in radiology while acknowledging current limitations in medical visual grounding.
Abstract
Multimodal Large Language Models (MLLMs) have shown success in various general image processing tasks, yet their application in medical imaging is nascent, lacking tailored models. This study investigates the potential of MLLMs in improving the understanding and generation of Chest X-Rays (CXRs). We introduce MedXChat, a unified framework facilitating seamless interactions between medical assistants and users for diverse CXR tasks, including text report generation, visual question-answering (VQA), and Text-to-CXR generation. Our MLLMs using natural language as the input breaks task boundaries, maximally simplifying medical professional training by allowing diverse tasks within a single environment. For CXR understanding, we leverage powerful off-the-shelf visual encoders (e.g., ViT) and LLMs (e.g., mPLUG-Owl) to convert medical imagery into language-like features, and subsequently fine-tune our large pre-trained models for medical applications using a visual adapter network and a delta-tuning approach. For CXR generation, we introduce an innovative synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Through comprehensive experiments, our model demonstrates exceptional cross-task adaptability, displaying adeptness across all three defined tasks. Our MedXChat model and the instruction dataset utilized in this research will be made publicly available to encourage further exploration in the field.
