Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine
Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang
TL;DR
MedPLIB tackles pixel-level grounding in biomedicine by integrating a three-layer architecture (encoder, Mixture-of-Experts LLM, decoder) with a novel multi-stage MoE training strategy that allocates independent prior knowledge to visual-language and grounding tasks. It expands inputs to include pixel-level prompts and region-aware regions, while maintaining inference efficiency through expert routing. The accompanying MeCoVQA dataset provides eight modalities and 310k QA pairs to support complex medical VQA and region grounding, enabling robust evaluation of pixel-grounding capabilities. Across OmniMedVQA, MeCoVQA, and zero-shot settings, MedPLIB achieves state-of-the-art results, strong region-level performance, and notable generalization, signaling practical impact for adaptable biomedical AI assistants. The work is open-sourced, inviting community adoption and further research into reliable pixel-level biomedical reasoning.
Abstract
In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.
