Table of Contents
Fetching ...

Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

Xun Zhu, Fanbin Mo, Zheng Zhang, Jiaxi Wang, Yiming Shi, Ming Wu, Chuang Zhang, Miao Li, Ji Wu

TL;DR

The paper tackles the data-centric bottleneck in multi-task learning for medical generalist foundations by proposing IMAX, an image-centric multi-annotation X-ray dataset that provides dense, multi-task annotations per image. It demonstrates that fine-tuning seven open-source medical MLLMs on IMAX yields substantial multi-task gains (3.20%–21.05% on average) compared with decentralized DMAX data, and links these improvements to optimization dynamics analyzed via the Fisher information matrix, showing reduced spectral entropy $SE$ and higher dominant eigenvalue ratio $ρ$ during IMAX training. To address practical data-collection constraints, the authors introduce a three-stage DMAX-based strategy with pseudo IMAX data that achieves notable gains, indicating the approach's utility when high-quality IMAX data are scarce. Collectively, the work highlights data construction as a critical lever for multi-task capability in medical generalist models and charts a path toward extending image-centric data principles to 3D modalities and broader clinical tasks, with resources to be released for community use.

Abstract

The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

TL;DR

The paper tackles the data-centric bottleneck in multi-task learning for medical generalist foundations by proposing IMAX, an image-centric multi-annotation X-ray dataset that provides dense, multi-task annotations per image. It demonstrates that fine-tuning seven open-source medical MLLMs on IMAX yields substantial multi-task gains (3.20%–21.05% on average) compared with decentralized DMAX data, and links these improvements to optimization dynamics analyzed via the Fisher information matrix, showing reduced spectral entropy and higher dominant eigenvalue ratio during IMAX training. To address practical data-collection constraints, the authors introduce a three-stage DMAX-based strategy with pseudo IMAX data that achieves notable gains, indicating the approach's utility when high-quality IMAX data are scarce. Collectively, the work highlights data construction as a critical lever for multi-task capability in medical generalist models and charts a path toward extending image-centric data principles to 3D modalities and broader clinical tasks, with resources to be released for community use.

Abstract

The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

Paper Structure

This paper contains 39 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the dataset construction. (a) Example of general decentralized multi-annotation data for multi-task learning. (b) Example of our image-centric multi-annotation data for multi-task learning.
  • Figure 2: Dataset statistics. (a) The overall composition of image-centric multi-annotation X-ray dataset (IMAX). (b) The distribution of the number of tasks corresponding to each image. (c) The distribution of the number of train data entries corresponding to each image.
  • Figure 3: Illustration of the benefits image-centric multi-annotation data bring to the multi-task learning ability of medical foundation model. We compare the performance of various medical MLLMs fine-tuned with IMAX and DMAX, respectively. CLS I and CLS II represent multi-class classification and multi-label classification, respectively.
  • Figure 4: Illustration of our proposed training strategy for DMAX data. Our pipeline comprises three stages: (1) primary training; (2) cross-task pseudo label generation; (3) hybrid pretraining and fine-tuning.
  • Figure 5: The violin plots and boxplots of the spectral entropy of the FIM eigenvalue at the initial, early, mid, and late phrases of training process.