A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography

Nicholas Evans; Stephen Baker; Miles Reed

A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography

Nicholas Evans, Stephen Baker, Miles Reed

TL;DR

This work tackles the challenge of bidirectional, multimodal understanding in chest radiography by introducing MAViLT, a framework that unifies vision and language within a single generative model. It combines clinical gradient-weighted VQ-GAN tokenization with a two-stage hierarchical fine-tuning regime to enable CXR-to-report generation, report-to-CXR generation, and vision-based clinical question answering, while preserving the LLM’s language capabilities. Across MIMIC-CXR and Indiana University CXR datasets, MAViLT achieves state-of-the-art performance on automatic metrics and is validated by radiologists for clinical relevance, demonstrating robust generalization and workflow efficiency. The proposed approach offers a practical, scalable path toward integrating advanced multimodal AI into real-world radiology pipelines and motivates future extensions to additional imaging modalities and temporal reasoning.

Abstract

The rapid advancements in large language models (LLMs) have unlocked their potential for multimodal tasks, where text and visual data are processed jointly. However, applying LLMs to medical imaging, particularly for chest X-rays (CXR), poses significant challenges due to the need for precise visual-textual alignment and the preservation of critical diagnostic details. In this paper, we propose Multi-Stage Adaptive Vision-Language Tuning (MAViLT), a novel framework designed to enhance multimodal reasoning and generation for CXR understanding. MAViLT incorporates a clinical gradient-weighted tokenization process and a hierarchical fine-tuning strategy, enabling it to generate accurate radiology reports, synthesize realistic CXRs from text, and answer vision-based clinical questions. We evaluate MAViLT on two benchmark datasets, MIMIC-CXR and Indiana University CXR, achieving state-of-the-art results across all tasks. Human evaluations further validate the clinical relevance and utility of MAViLT, making it a robust tool for real-world medical applications. This work demonstrates the feasibility of leveraging LLMs for multimodal medical imaging while addressing key challenges in vision-language integration.

A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography

TL;DR

Abstract

A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography

TL;DR

Abstract

Paper Structure

Table of Contents