Table of Contents
Fetching ...

TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation

Ranmin Wang, Limin Zhuang, Hongkun Chen, Boyan Xu, Ruichu Cai

TL;DR

TP-UNet is proposed that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model and is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively.

Abstract

The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.

TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation

TL;DR

TP-UNet is proposed that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model and is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively.

Abstract

The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.

Paper Structure

This paper contains 18 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The temporal information of liver. We visualized the temporal information of the liver. From the kernel density plot of the liver occurrence probability, it can be seen that the distribution approximately follows a normal distribution $\mathcal{N}(\mu_{Liver}, \sigma_{Liver})$ for a set of timestamps ranging from $\frac{1}{N}$ to $\frac{N}{N}$. The timestamp with the highest frequency of liver occurrence is approximately 0.78. For multiple organs, such as the three organs in the UW-Madison dataset (i.e., stomach, large intestine, and small intestine), their probabilities of occurrence at different timestamps also vary, typically $\mu_{stomach} \leq \mu_{small} \leq \mu_{large}$. This temporal information is crucial for guiding the model in segmentation tasks.
  • Figure 2: The general framework of TP-UNet. For a given medical image $I$ that needs segmentation, TP-UNet first automatically generates its corresponding temporal prompt $P_t$. The UNet encoder then extracts features from the input medical image $I$. These extracted features are fused with the encoded temporal prompt $F_t$. Prior to fusion, a semantic alignment operation is performed to bridge the gap between different modality encoders. Finally, the UNet decodes the fused features to produce the final masks. It should be noted that the text encoder in this study employs two architectures: CLIP and Electra. These architectures are trained using the LoRA and SFT methods, respectively.
  • Figure 3: Case Study. We conducted four case studies on the LITS dataset. From the results of the qualitative analysis, our method achieved excellent performance.