Table of Contents
Fetching ...

Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis

Phuoc-Nguyen Bui, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

TL;DR

This work tackles the challenge of jointly performing segmentation and classification in medical imaging by introducing a UNet-based multi-task learning framework. The core innovation is the ResFormer encoder, which fuses CNN-based local feature extraction with Transformer-based global context, and a Dilated Feature Enhancement decoder that aggregates multi-scale information for robust segmentation. The model also leverages multi-scale encoder features for image-level classification, using a simple yet effective head and a weighted loss that balances tasks. Across RETOUCH (OCT) and ISIC 2017 (skin lesions), the approach achieves state-of-the-art performance for both segmentation (DSC/JI) and classification (accuracy, sensitivity, specificity), demonstrating strong cross-domain generalization and the benefits of integrating local and global representations within a unified MT framework. The work highlights practical impact for improved diagnostic support and points to future semi-supervised strategies to reduce annotation burdens.

Abstract

Traditional deep learning methods in medical imaging often focus solely on segmentation or classification, limiting their ability to leverage shared information. Multi-task learning (MTL) addresses this by combining both tasks through shared representations but often struggles to balance local spatial features for segmentation and global semantic features for classification, leading to suboptimal performance. In this paper, we propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. The model introduces an advanced encoder incorporating a novel ResFormer block that integrates local context from convolutional feature extraction with long-range dependencies modeled by the Transformer. This design captures broader contextual relationships and fine-grained details, improving classification and segmentation accuracy. To enhance classification performance, multi-scale features from different encoder levels are combined to leverage the hierarchical representation of the input image. For segmentation, the features passed to the decoder via skip connections are refined using a novel dilated feature enhancement (DFE) module, which captures information at different scales through three parallel convolution branches with varying dilation rates. This allows the decoder to detect lesions of varying sizes with greater accuracy. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks, compared to state-of-the-art single-task and multi-task learning methods.

Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis

TL;DR

This work tackles the challenge of jointly performing segmentation and classification in medical imaging by introducing a UNet-based multi-task learning framework. The core innovation is the ResFormer encoder, which fuses CNN-based local feature extraction with Transformer-based global context, and a Dilated Feature Enhancement decoder that aggregates multi-scale information for robust segmentation. The model also leverages multi-scale encoder features for image-level classification, using a simple yet effective head and a weighted loss that balances tasks. Across RETOUCH (OCT) and ISIC 2017 (skin lesions), the approach achieves state-of-the-art performance for both segmentation (DSC/JI) and classification (accuracy, sensitivity, specificity), demonstrating strong cross-domain generalization and the benefits of integrating local and global representations within a unified MT framework. The work highlights practical impact for improved diagnostic support and points to future semi-supervised strategies to reduce annotation burdens.

Abstract

Traditional deep learning methods in medical imaging often focus solely on segmentation or classification, limiting their ability to leverage shared information. Multi-task learning (MTL) addresses this by combining both tasks through shared representations but often struggles to balance local spatial features for segmentation and global semantic features for classification, leading to suboptimal performance. In this paper, we propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. The model introduces an advanced encoder incorporating a novel ResFormer block that integrates local context from convolutional feature extraction with long-range dependencies modeled by the Transformer. This design captures broader contextual relationships and fine-grained details, improving classification and segmentation accuracy. To enhance classification performance, multi-scale features from different encoder levels are combined to leverage the hierarchical representation of the input image. For segmentation, the features passed to the decoder via skip connections are refined using a novel dilated feature enhancement (DFE) module, which captures information at different scales through three parallel convolution branches with varying dilation rates. This allows the decoder to detect lesions of varying sizes with greater accuracy. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks, compared to state-of-the-art single-task and multi-task learning methods.

Paper Structure

This paper contains 14 sections, 16 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Evolution of multi-task learning methods in medical image classification and segmentation. Conv: Convolutional, Trans: Transformer.
  • Figure 2: The overview of multi-task learning method for medical image classification and segmentation. The proposed method utilizes the U-Net architecture with a shared encoder and two dedicated decoders for the classification and segmentation tasks, respectively.
  • Figure 3: Two designs of the proposed ResFormer block which combines ResNet he2016deep and Swin-Transformer liu2021swin blocks.
  • Figure 4: The proposed dilated feature enhancement (DFE) module.
  • Figure 5: Segmentation visualization of MTL methods on the RETOUCH dataset. Red, green, and blue indicate intra-retinal fluid, sub-retinal fluid, and pigment epithelial detachment, respectively.
  • ...and 3 more figures