WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

Bin Wang; Fei Deng; Peifan Jiang; Shuang Wang; Xiao Han; Zhixuan Zhang

WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

Bin Wang, Fei Deng, Peifan Jiang, Shuang Wang, Xiao Han, Zhixuan Zhang

TL;DR

WiTUnet tackles LDCT denoising by rethinking U-shaped architecture and integrating CNNs with Transformers. It introduces nested dense skip blocks for better encoder–decoder feature alignment, a non-overlapping windowed Transformer to reduce compute, and LiPe to enhance local detail capture. The approach yields superior PSNR, SSIM, and RMSE over strong baselines on the NIH-AAPM-Mayo LDCT dataset, with ablations confirming the value of LiPe and nested dense connections. This architecture balances global and local information processing, offering improved denoising quality with practical efficiency for clinical imaging workflows.

Abstract

Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct images.In this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.

WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 6 figures, 2 tables)

This paper contains 19 sections, 12 equations, 6 figures, 2 tables.

Introduction
Related work
CNN
Transformer
Method
Overall architecture
Window Transformer block
Nested dense block
Experiments and results
Experiments Detail
Datasets
Train details
Results
Evaluation measures
Comparing method
...and 4 more sections

Figures (6)

Figure 1: (a) Illustrates the U-shaped network architecture of WiTUnet, featuring encoder and decoder connected by nested dense blocks. (b) Describes the structure of the encoder, bottleneck and decoder, consisting of $N$ WT blocks, and for the decoder, an additional channel projection is depicted. (c) Details the WT block, highlighting the layers and their functions, including Layer Normalization and Local Image Perception Enhancement (LiPe).
Figure 2: The diagram visualizes the Local Image Perspective Enhancement (LiPe) module, showcasing the sequence of operations including convolutional layers and the transformation from image to token and back, facilitating the feature refinement process within the network architecture.
Figure 3: (a) Demonstrates the nested dense network structure of WiTUnet, highlighting the complex skip connectiion approach. In the figure, $X_{k,0}$ where $k\in [0,D)$ represents the encoder at different layers, $X_{k,v}$ where $k\in [0,D)$ and $v=D-k$ represents the decoder, and$X_{D,0}$ is the bottleneck layer. The green and blue arrows show the specially designed jump paths. The redesigned path changes the connection between the codecs, and the encoder's feature map needs to be processed through multiple dense convolutional blocks before it can be passed to the decoder. (b) The computational process at each node is described to help understand the flow and processing of information throughout the network structure.
Figure 4: Image pair of NIH-AAPM Mayo dataset (partial).
Figure 5: (a) LDCT. (b) FDCT. (c) DnCNN. (d) REDCNN. (e) ADNet. (f) NBNet. (g) CTformer. (h) WiTUnet. This figure also includes magnified views of the region of interest (ROI) for each denoising method. To visualize the differences between the denoising results and the FDCT images, differential visualizations are provided, depicting the noise reduction outcome relative to the FDCT. The display window is set to [-160, 240] Hounsfield Units (HU) to enhance the visualization. Additionally, to optimize the visual presentation, the brightness of all visualization results has been adjusted.
...and 1 more figures

WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

TL;DR

Abstract

WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)