MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Hritik Bansal; Daniel Israel; Siyan Zhao; Shufan Li; Tung Nguyen; Aditya Grover

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Hritik Bansal, Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen, Aditya Grover

TL;DR

MedMax tackles the lack of large-scale, diverse multimodal biomedical instruction data by introducing MedMax, a 1.47M-instance dataset spanning radiology and histopathology and including novel interleaved image-text content. The authors train a mixed-modal foundation model using LoRA on Anole-7B and demonstrate substantial gains over Chameleon and GPT-4o across 12 biomedical VQA tasks, across captioning, generation, and visual chat tasks. They also provide a unified evaluation suite to standardize assessment across modalities and tasks. The work emphasizes scalable data curation, diverse skill coverage, and rigorous ablations, establishing MedMax as a foundation for robust, domain-specific multimodal biomedical AI with practical potential for clinical support and research.

Abstract

Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. The data, model, and code is available at https://mint-medmax.github.io/.

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

TL;DR

Abstract

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)