Table of Contents
Fetching ...

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

Tianrun Chen, Chaotao Ding, Lanyun Zhu, Tao Xu, Deyi Ji, Yan Wang, Ying Zang, Zejian Li

TL;DR

This work introduces xLSTM-UNet, a UNet-inspired segmentation network that incorporates Vision-LSTM (xLSTM) blocks as backbones to capture long-range dependencies with linear computational complexity. By embedding xLSTM blocks throughout the encoder and leveraging skip connections, the model achieves superior 2D and 3D medical image segmentation across abdomen MRI, endoscopy, microscopy, and BraTS brain MRI datasets, outperforming CNN-, Transformer-, and Mamba-based baselines. The results demonstrate that xLSTM-based backbones can set new benchmarks in medical segmentation, offering improved accuracy and efficiency, with code and models publicly available. The study also discusses limitations, such as dataset size and hardware optimization needs, and suggests directions for scaling and broader application of xLSTM architectures in medical imaging.

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at http://tianrun-chen.github.io/xLSTM-UNet/

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

TL;DR

This work introduces xLSTM-UNet, a UNet-inspired segmentation network that incorporates Vision-LSTM (xLSTM) blocks as backbones to capture long-range dependencies with linear computational complexity. By embedding xLSTM blocks throughout the encoder and leveraging skip connections, the model achieves superior 2D and 3D medical image segmentation across abdomen MRI, endoscopy, microscopy, and BraTS brain MRI datasets, outperforming CNN-, Transformer-, and Mamba-based baselines. The results demonstrate that xLSTM-based backbones can set new benchmarks in medical segmentation, offering improved accuracy and efficiency, with code and models publicly available. The study also discusses limitations, such as dataset size and hardware optimization needs, and suggests directions for scaling and broader application of xLSTM architectures in medical imaging.

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at http://tianrun-chen.github.io/xLSTM-UNet/
Paper Structure (11 sections, 2 figures, 3 tables)

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The framework of the proposed method.
  • Figure 2: Visualized examples of 2D medical segmentation. xLSTM-UNet demonstrates greater robustness to heterogeneous appearances and exhibits fewer segmentation errors.