Table of Contents
Fetching ...

Universal Medical Image Representation Learning with Compositional Decoders

Kaini Wang, Ling Yang, Siping Zhou, Guangquan Zhou, Wentao Zhang, Bin Cui, Shuo Li

TL;DR

A decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels is developed that achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability.

Abstract

Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically designed with task-specific branches and heads, which restricts the shared feature space and the flexibility of model. To address these challenges, we have developed a decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels. To this end, we first propose a decomposed decoder that can predict two types of outputs -- pixel and semantic, based on a defined input queue. Additionally, we introduce a composed decoder that unifies the input and output spaces and standardizes task annotations across different levels into a discrete token format. The coupled design of these two components enables the model to flexibly combine tasks and mutual benefits. Moreover, our joint representation learning strategy skilfully leverages large amounts of unlabeled data and unsupervised loss, achieving efficient one-stage pretraining for more robust performance. Experimental results show that UniMed achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability. We will release the code and trained models upon the paper's acceptance.

Universal Medical Image Representation Learning with Compositional Decoders

TL;DR

A decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels is developed that achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability.

Abstract

Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically designed with task-specific branches and heads, which restricts the shared feature space and the flexibility of model. To address these challenges, we have developed a decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels. To this end, we first propose a decomposed decoder that can predict two types of outputs -- pixel and semantic, based on a defined input queue. Additionally, we introduce a composed decoder that unifies the input and output spaces and standardizes task annotations across different levels into a discrete token format. The coupled design of these two components enables the model to flexibly combine tasks and mutual benefits. Moreover, our joint representation learning strategy skilfully leverages large amounts of unlabeled data and unsupervised loss, achieving efficient one-stage pretraining for more robust performance. Experimental results show that UniMed achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability. We will release the code and trained models upon the paper's acceptance.
Paper Structure (16 sections, 10 equations, 5 figures, 8 tables)

This paper contains 16 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: a) The broad range of tasks in medical image analysis. b) The diversity of annotations both across tasks and between different datasets. c) Existing models require task-specific branches or heads. d) The proposed universal model seamlessly supports all levels of tasks by matching the decompose output decoder with the compose label decoder.
  • Figure 2: Overview of UniMed, consisting of four core components: a visual encoder, a text encoder, and a decomposed decoder and composed decoders. The decomposed decoder serves to amalgamate the output space of tasks into discrete tokens, encapsulating both semantic and pixel outputs. Similarly, composed decoders are harmonized into the same formats via a label converter to support cross-task learning.
  • Figure 3: UniMed exhibits the capability to perform various medical image analysis tasks by dynamically combining input and output terminals. Specifically, include a) General classification/detection. b) General segmentation. c) Referring classification/detection. d) Referring segmentation.
  • Figure 4: Visualization results on detection and segmentation tasks compared with other methods.
  • Figure 5: Qualitative results demonstrate UniMed's ability to support referring tasks and help clinically obtain specified predictions.