Table of Contents
Fetching ...

UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections

Lingxiao Yin, Wei Tao, Dongyue Zhao, Tadayuki Ito, Kinya Osa, Masami Kato, Tse-Wei Chen

TL;DR

The paper tackles the high memory cost of skip-connections in U-Net by introducing two plug-in modules, MSIAM and IEM, that convert multi-scale encoder features into a single-scale representation and then regenerate enhanced multi-scale features for decoding. This design yields UNet--, a memory-efficient, feature-enhanced variant that preserves accuracy while reducing skip-connection memory by up to ~94%. The authors validate the approach by integrating UNet-- with the strong image restoration model NAFNet and evaluating on denoising, deblurring, super-resolution, and matting benchmarks, showing consistent memory savings and performance gains. The method is modular and task-agnostic, enabling easy adoption across U-Net variants and a range of vision tasks, with practical impact for deployment on resource-limited devices.

Abstract

U-Net models with encoder, decoder, and skip-connections components have demonstrated effectiveness in a variety of vision tasks. The skip-connections transmit fine-grained information from the encoder to the decoder. It is necessary to maintain the feature maps used by the skip-connections in memory before the decoding stage. Therefore, they are not friendly to devices with limited resource. In this paper, we propose a universal method and architecture to reduce the memory consumption and meanwhile generate enhanced feature maps to improve network performance. To this end, we design a simple but effective Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates multi-scale feature maps into single-scale with less memory. After that, the aggregated feature maps can be expanded and enhanced to multi-scale feature maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the field of image restoration, we design a memory-efficient and feature-enhanced network architecture, UNet--. The memory demand by the skip-connections in the UNet-- is reduced by 93.3%, while the performance is improved compared to NAFNet. Furthermore, we show that our proposed method can be generalized to multiple visual tasks, with consistent improvements in both memory consumption and network accuracy compared to the existing efficient architectures.

UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections

TL;DR

The paper tackles the high memory cost of skip-connections in U-Net by introducing two plug-in modules, MSIAM and IEM, that convert multi-scale encoder features into a single-scale representation and then regenerate enhanced multi-scale features for decoding. This design yields UNet--, a memory-efficient, feature-enhanced variant that preserves accuracy while reducing skip-connection memory by up to ~94%. The authors validate the approach by integrating UNet-- with the strong image restoration model NAFNet and evaluating on denoising, deblurring, super-resolution, and matting benchmarks, showing consistent memory savings and performance gains. The method is modular and task-agnostic, enabling easy adoption across U-Net variants and a range of vision tasks, with practical impact for deployment on resource-limited devices.

Abstract

U-Net models with encoder, decoder, and skip-connections components have demonstrated effectiveness in a variety of vision tasks. The skip-connections transmit fine-grained information from the encoder to the decoder. It is necessary to maintain the feature maps used by the skip-connections in memory before the decoding stage. Therefore, they are not friendly to devices with limited resource. In this paper, we propose a universal method and architecture to reduce the memory consumption and meanwhile generate enhanced feature maps to improve network performance. To this end, we design a simple but effective Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates multi-scale feature maps into single-scale with less memory. After that, the aggregated feature maps can be expanded and enhanced to multi-scale feature maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the field of image restoration, we design a memory-efficient and feature-enhanced network architecture, UNet--. The memory demand by the skip-connections in the UNet-- is reduced by 93.3%, while the performance is improved compared to NAFNet. Furthermore, we show that our proposed method can be generalized to multiple visual tasks, with consistent improvements in both memory consumption and network accuracy compared to the existing efficient architectures.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Architecture of a common U-Net model. The dashed lines represent the skip-connections. Feature maps in the blue shadow are needed to be maintained in memory. (b) Memory consumption for skip-connections for U-Net and UNet$--$ throughout the inference process, the unit is $M_{E_1}$ which denotes the memory consumption of $E_1$.
  • Figure 2: Illustration of the proposed UNet$--$ network. The original skip-connections are replaced with MSIAM and IEM. MSIAM aggregates multi-scale feature maps to single-scale with less memory demand. IEM generates enhanced multi-scale feature maps according to the output of MSIAM. Feature maps in the blue shadow are needed to be maintained in memory.
  • Figure 3: Model structure variants corresponding to different target resolution. Feature maps in the blue shadow are needed to be maintained in memory. (a) The resolution of the aggregated feature maps equals to the minimum resolution in the encoder. (b) The resolution of the aggregated feature maps equals to the intermediate resolution in the encoder. (c) The resolution of the aggregated feature map equals to the maximum resolution in the encoder.
  • Figure 4: Visualization results. The first row includes noisy images, the second row includes clean images outputted by NAFNetchen2022simple with UNet$--$, the third row shows the comparison of local details.
  • Figure 5: Visualization results. The first row includes blurry image, the second row includes clean images outputted by NAFNetchen2022simple with UNet$--$, the third row shows the comparison of local details.
  • ...and 2 more figures