Table of Contents
Fetching ...

Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

Jonghun Kim, Sinyoung Ra, Hyunjin Park

Abstract

LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

Abstract

LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

Paper Structure

This paper contains 14 sections, 7 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Example of LLaBIT performing versatile tasks on brain MR images. LLaBIT supports report generation and image-to-image tasks.
  • Figure 2: Text data generation with LLMs on a dataset with only images. Images and captions are processed by LLMs with strict predefined instructions and few-shot samples selected by clinicians to generate reports and VQA results. The output of each model is accepted or rejected using GPT 4o and the final report and VQA are regenerated based on this feedback.
  • Figure 3: Instruction tuning pipeline. Both text and images are tokenized and fed into the LLM, which can generate either text or image tokens as output. The LLM's vocabulary is extended to include image tokens in addition to text tokens. The instruction is provided to the LLM as text tokens. The image is converted into quantized tokens using a VQ encoder and fed into the LLM along with an <input> token. These image tokens are added to the LLM's vocabulary as discrete values, similar to text tokens. For image translation, the output is generated with an <output> token, while for segmentation, the output is generated with a <seg> token.
  • Figure 4: Fine-tuning of VQ-GAN with zero skip connection. The skip connection is fine-tuned, while freezing the image encoder and decoder. A zero convolution block is adopted using a BiomedCLIP text encoder and prompt tuning to flexibly adapt to the target.
  • Figure 5: Loss functions for image-to-image tasks. (a) The translation task is trained using reconstruction loss. (b) The segmentation task is trained using Dice loss with an additional layer.
  • ...and 11 more figures