Table of Contents
Fetching ...

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Hritik Bansal, Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen, Aditya Grover

TL;DR

MedMax tackles the lack of large-scale, diverse multimodal biomedical instruction data by introducing MedMax, a 1.47M-instance dataset spanning radiology and histopathology and including novel interleaved image-text content. The authors train a mixed-modal foundation model using LoRA on Anole-7B and demonstrate substantial gains over Chameleon and GPT-4o across 12 biomedical VQA tasks, across captioning, generation, and visual chat tasks. They also provide a unified evaluation suite to standardize assessment across modalities and tasks. The work emphasizes scalable data curation, diverse skill coverage, and rigorous ablations, establishing MedMax as a foundation for robust, domain-specific multimodal biomedical AI with practical potential for clinical support and research.

Abstract

Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. The data, model, and code is available at https://mint-medmax.github.io/.

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

TL;DR

MedMax tackles the lack of large-scale, diverse multimodal biomedical instruction data by introducing MedMax, a 1.47M-instance dataset spanning radiology and histopathology and including novel interleaved image-text content. The authors train a mixed-modal foundation model using LoRA on Anole-7B and demonstrate substantial gains over Chameleon and GPT-4o across 12 biomedical VQA tasks, across captioning, generation, and visual chat tasks. They also provide a unified evaluation suite to standardize assessment across modalities and tasks. The work emphasizes scalable data curation, diverse skill coverage, and rigorous ablations, establishing MedMax as a foundation for robust, domain-specific multimodal biomedical AI with practical potential for clinical support and research.

Abstract

Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. The data, model, and code is available at https://mint-medmax.github.io/.

Paper Structure

This paper contains 49 sections, 2 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Examples of diverse multimodal biomedical tasks covered in the MedMax dataset. The model inputs (yellow boxes) and corresponding outputs (red boxes) illustrate various task types: multimodal generation with interleaved text and images, medical report generation, text-to-image generation, visual question answering, medical image analysis through visual chat, and image captioning task. Note that report-conditioned image generation, which falls under report understanding, is not shown here.
  • Figure 2: A mixed-modal foundation model is capable of understanding text and image inputs and can generate both textual and visual outputs through a unified architecture.
  • Figure 3: We present the data-sources used to curate task-specific data in the MedMax collection.
  • Figure 4: We source the data from biomedical sources that cover several domains (e.g., radiology) and knowledge bases (e.g., research papers, YouTube).
  • Figure 5: Performance on the multimodal generation task. Comparison between the performance of the MedMax and Chameleon mixed-modal model on the multimodal generation task. We find that MedMax finetuning improves the multimodal content generation capabilities for the biomedical domain.
  • ...and 7 more figures