A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography
Nicholas Evans, Stephen Baker, Miles Reed
TL;DR
This work tackles the challenge of bidirectional, multimodal understanding in chest radiography by introducing MAViLT, a framework that unifies vision and language within a single generative model. It combines clinical gradient-weighted VQ-GAN tokenization with a two-stage hierarchical fine-tuning regime to enable CXR-to-report generation, report-to-CXR generation, and vision-based clinical question answering, while preserving the LLM’s language capabilities. Across MIMIC-CXR and Indiana University CXR datasets, MAViLT achieves state-of-the-art performance on automatic metrics and is validated by radiologists for clinical relevance, demonstrating robust generalization and workflow efficiency. The proposed approach offers a practical, scalable path toward integrating advanced multimodal AI into real-world radiology pipelines and motivates future extensions to additional imaging modalities and temporal reasoning.
Abstract
The rapid advancements in large language models (LLMs) have unlocked their potential for multimodal tasks, where text and visual data are processed jointly. However, applying LLMs to medical imaging, particularly for chest X-rays (CXR), poses significant challenges due to the need for precise visual-textual alignment and the preservation of critical diagnostic details. In this paper, we propose Multi-Stage Adaptive Vision-Language Tuning (MAViLT), a novel framework designed to enhance multimodal reasoning and generation for CXR understanding. MAViLT incorporates a clinical gradient-weighted tokenization process and a hierarchical fine-tuning strategy, enabling it to generate accurate radiology reports, synthesize realistic CXRs from text, and answer vision-based clinical questions. We evaluate MAViLT on two benchmark datasets, MIMIC-CXR and Indiana University CXR, achieving state-of-the-art results across all tasks. Human evaluations further validate the clinical relevance and utility of MAViLT, making it a robust tool for real-world medical applications. This work demonstrates the feasibility of leveraging LLMs for multimodal medical imaging while addressing key challenges in vision-language integration.
