Rethinking Information Loss in Medical Image Segmentation with Various-sized Targets

Tianyi Liu; Zhaorui Tan; Kaizhu Huang; Haochuan Jiang

Rethinking Information Loss in Medical Image Segmentation with Various-sized Targets

Tianyi Liu, Zhaorui Tan, Kaizhu Huang, Haochuan Jiang

TL;DR

This paper argues that a well-designed fusion structure can mitigate the divergence in latent feature distributions between CNNs and ViTs, thereby reducing information loss and introduces a novel Stagger Network (SNet).

Abstract

Medical image segmentation presents the challenge of segmenting various-size targets, demanding the model to effectively capture both local and global information. Despite recent efforts using CNNs and ViTs to predict annotations of different scales, these approaches often struggle to effectively balance the detection of targets across varying sizes. Simply utilizing local information from CNNs and global relationships from ViTs without considering potential significant divergence in latent feature distributions may result in substantial information loss. To address this issue, in this paper, we will introduce a novel Stagger Network (SNet) and argues that a well-designed fusion structure can mitigate the divergence in latent feature distributions between CNNs and ViTs, thereby reducing information loss. Specifically, to emphasize both global dependencies and local focus, we design a Parallel Module to bridge the semantic gap. Meanwhile, we propose the Stagger Module, trying to fuse the selected features that are more semantically similar. An Information Recovery Module is further adopted to recover complementary information back to the network. As a key contribution, we theoretically analyze that the proposed parallel and stagger strategies would lead to less information loss, thus certifying the SNet's rationale. Experimental results clearly proved that the proposed SNet excels comparisons with recent SOTAs in segmenting on the Synapse dataset where targets are in various sizes. Besides, it also demonstrates superiority on the ACDC and the MoNuSeg datasets where targets are with more consistent dimensions.

Rethinking Information Loss in Medical Image Segmentation with Various-sized Targets

TL;DR

Abstract

Paper Structure (26 sections, 2 theorems, 10 equations, 7 figures, 5 tables)

This paper contains 26 sections, 2 theorems, 10 equations, 7 figures, 5 tables.

Introduction
Related Work
CNNs and ViTs
Feature Fusion Methods
Simple Replacement Methods
Advanced Fusion Methods
Theoretical Motivation of Stagger Fusion
Methodology
Overview
Parallel Module
Stagger Module
Information Recovery Module
Experiments
Experimental Setup
Datasets
...and 11 more sections

Key Result

Theorem 2

The Han's inequality is presented below: Let $X^i$ be discrete $i$-dimensional random variable and denote $\bar{H}^k\left(X^i\right)=\frac{1}{\left(^i_k \right)} \sum_{T \subset \left(^{[i]}_{\;k} \right)} H(X_{T})$ as the average entropy of randomly selected $k$ dimensions $(k\leq i)$. Then $\frac{

Figures (7)

Figure 1: (Top) Visualization of feature heatmaps and histogram distributions of lower CNNs, higher CNNs, lower ViTs, and higher ViTs. Higher layers have a darker color than lower layers. (Bottom) Unstagger fusion fuses lower layers of CNNs with those from lower ViTs, as well as features from higher layers of CNNs with those from higher ViTs. Stagger fusion fuses features from higher layers of CNNs and those from lower ViTs. Different colors represent dissimilar distributions of these feature layers.
Figure 2: This figure depicts unstagger fusion in (a) and stagger fusion in (b). Heatmaps visualizing input layers are presented on the first two columns, with the name of each layer located at the bottom and its corresponding density map situated above the heatmap. The heat maps and density maps of fusion results are illustrated in the third column. The segmentation results of each fusion method can be seen in the fourth column, and the input image and its ground-truth label can be seen in the fifth column.
Figure 3: The overall framework of SNet. The label C in a circle means concatenate.
Figure 4: Feature Enhancement Module (FEB): As seen in Figure \ref{['are']}, the raw image is the input of the 1st ViT layer. After the down-sampling, the output of the 1st ViT layer becomes the input of the 2nd ViT layer. Then the output of the 1st and 2nd ViT layer $F_i$ and $F_{i+1}$ will be fused in FEB and then split back to two features $F^{'}_{i}$ and $F^{'}_{i+1}$.
Figure 5: Feature Fusion Block (FFB): $\bigoplus$ means concatenation.
...and 2 more figures

Theorems & Definitions (3)

Theorem 2: Han's inequality boucheronconcentration
Proposition 3
proof

Rethinking Information Loss in Medical Image Segmentation with Various-sized Targets

TL;DR

Abstract

Rethinking Information Loss in Medical Image Segmentation with Various-sized Targets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)