Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

Yun Zhu; Dong Zhang; Yi Lin; Yifei Feng; Jinhui Tang

Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

Yun Zhu, Dong Zhang, Yi Lin, Yifei Feng, Jinhui Tang

TL;DR

This work tackles the challenge of medical image segmentation by enabling simultaneous modeling of long-range and local spatial context. It introduces CCViM, a Vision Mamba-based architecture that integrates a context clustering (CC) layer into a CCS6 module to adaptively form local windows while preserving global receptive fields. Extensive experiments on Kumar, CPM17, ISIC2017/2018, and Synapse demonstrate superior performance over state-of-the-art methods across nuclei, skin lesion, and multi-organ segmentation, with ablations confirming the CC layer’s effectiveness. The approach offers a computationally efficient means to fuse local and global information, with potential for adaptive scanning strategies and broader medical imaging tasks.

Abstract

Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.

Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 8 figures, 6 tables)

This paper contains 20 sections, 6 equations, 8 figures, 6 tables.

Introduction
RELATED WORK
Medical Image Segmentation (MedISeg)
Feature Interaction for MedISeg
Vision Mamba (ViM)
Our Method
Preliminary and Notation
Overall Architecture
Context Clustering Selective State Space Model
Context Clustering
Post Processing
Loss Function
Experiments
Datasets and Evaluation Metrics
Configuration of Scan Directions and Local Clusters
...and 5 more sections

Figures (8)

Figure 1: Illustration of the feature interaction mechanism in MedISeg. CNNs deepcnn in (a) conceptualize an image as structured feature grids, employing convolutional layers that slides over the local space with a certain stride. ViT vit in (b) uses self-attention, treating an image as tokens, enabling each token to interact with other tokens. VMamba vmamba in (c) uses the cross-scan module to integrate the pixels from different directions, achieving a global receptive field with linear complexity. Our CCViM in (d) combines the cross-scan module with our context clustering layer. Our method treats an image as a set of data points, dynamically grouping all points into clusters within a local window to extract local contexts.
Figure 2: (a) The overall architecture of context clustering vision mamba (CCViM), a U-shaped structure for MedISeg. (b) CCViM block, the core component of CCViM. (c) CCS6 layer, the core module of CCViM block. CCS6 layer employs six different methods to process the input (patched feature maps). Four of these methods use a cross-scan module to flatten the patched feature maps and scan the flattened features in four different directions. The other two methods apply CC layers with $4$ and $25$ cluster centers. Then select four methods from all six methods to process the input feature map, and input the processed information into the S6 module. Finally, merge the output features to construct the final feature map. (d) CC layer, which views each patch as a set of feature grid points, and clusters these feature grid points into several centers.
Figure 3: The configuration of various scan directions and CC layer with different cluster centers. The configuration of each stage is different. We directly use the configuration of LocalMamba localmamba, replacing LocalMamba's local scan with our CC layer.
Figure 4: Visualizations on Kumar kumar2017dataset and CPM17 cpm datasets. Different colours of the nuclear boundaries denote separate instances.
Figure 5: Visualizations on the ISIC17 isic2017 Dataset. The left side presents the ground truth alongside the predicted masks from our model and the VM-UNet Vm-unet. It is evident that our predicted masks are closer to the ground truth. The right side displays the original skin images annotated with lesion contours; green contours denote the ground truth, while red contours indicate the predicted segmentation results. These comparisons further demonstrate the superior effectiveness of our CCViM in accurately segmenting skin lesions.
...and 3 more figures

Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

TL;DR

Abstract

Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)