Table of Contents
Fetching ...

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

Suhyeon Lee, Won Jun Kim, Jinho Chang, Jong Chul Ye

TL;DR

This work tackles vision-language alignment for medical imaging by instruction-tuning a text-only LLM to process chest X-ray content through VQ-GAN tokenization, eliminating the need for heavy architectural changes. The LLM-CXR framework expands the model's token space with image tokens, preserves clinical details with an auxiliary feature loss, and uses synthetic VQA data to enrich supervision. A two-stage, instruction-focused fine-tuning regimen enables bidirectional capabilities: CXR-to-report, report-to-CXR, and CXR-VQA, achieving strong performance across tasks with a relatively small 3B-parameter model. The approach demonstrates improved image-text alignment and generation quality, offering a practical path toward reliable multimodal radiology assistants, while acknowledging limitations in residual errors and latency that warrant further research.

Abstract

Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

TL;DR

This work tackles vision-language alignment for medical imaging by instruction-tuning a text-only LLM to process chest X-ray content through VQ-GAN tokenization, eliminating the need for heavy architectural changes. The LLM-CXR framework expands the model's token space with image tokens, preserves clinical details with an auxiliary feature loss, and uses synthetic VQA data to enrich supervision. A two-stage, instruction-focused fine-tuning regimen enables bidirectional capabilities: CXR-to-report, report-to-CXR, and CXR-VQA, achieving strong performance across tasks with a relatively small 3B-parameter model. The approach demonstrates improved image-text alignment and generation quality, offering a practical path toward reliable multimodal radiology assistants, while acknowledging limitations in residual errors and latency that warrant further research.

Abstract

Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.
Paper Structure (29 sections, 1 equation, 8 figures, 9 tables)

This paper contains 29 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) Example of previous work that indirectly implements multimodal bidirectional LLM by connecting a pretrained image encoder or image generation model to a pretrained LLM with a mapping layer. (b) Example of previous work that implements multimodal bidirectional non-LLM transformer with VQ-GAN trained from scratch (i.e., without learned language features). (c) To enable direct multimodal feature interaction in LLMs pre-trained with text, our method implements (b) through LLM-specific instruction fine-tuning scheme.
  • Figure 2: Examples of VQA generated from a CXR text report.\ref{['footnote:image-replaced']}
  • Figure 3: Examples of text report generation for a given CXR image with LLM-CXR. While the generated reports use different wording from the ground-truth reports, LLM-CXR is able to generate reports that capture the gist of the contents of the CXR, demonstrating alignment of vision-language features within the model. In addition, similar to real CXR reports, LLM-CXR often proposes valid causes for certain findings (e.g., suggesting aspiration as the cause of consolidation), demonstrating language-based reasoning ability characteristic of LLMs.\ref{['footnote:image-replaced']}
  • Figure 4: Examples of VQA with LLM-CXR. LLM-CXR understands questions given in natural language and is able to answer with relevant findings.\ref{['footnote:image-replaced']}
  • Figure 5: CXR images generated with LLM-CXR using radiology reports as input. (a) Normal CXRs. (b) Words such as "severe" and "mild" allow for the generation of different severities of lesions. (c) Specification of the location of lesions using words such as 'left', 'right', and 'bilateral'.
  • ...and 3 more figures