Table of Contents
Fetching ...

TK-Mamba: Marrying KAN With Mamba for Text-Driven 3D Medical Image Segmentation

Haoyu Yang, Yutong Guan, Meixing Shi, Yuxiang Cai, Jintao Chen, Sun Bing, Wenhui Lei, Mianxin Liu, Xiaoming Shi, Yankai Jiang, Jianwei Yin

TL;DR

This work tackles 3D medical image segmentation by addressing both computational efficiency and semantic robustness. It introduces TK-Mamba, a hybrid backbone that combines the linear-time Mamba with a novel 3D-GR-KAN nonlinear refiner, and a dual-branch text-driven mechanism built on PubMedCLIP embeddings to capture inter-organ semantics and align image features with anatomical descriptions. Key contributions include the 3D-GR-KAN module for data-adaptive nonlinear refinement and a two-branch text strategy that enhances segmentation accuracy for organs and tumors while mitigating label inconsistencies. Empirically, TK-Mamba achieves state-of-the-art performance on MSD and KiTS23 datasets, offering a favorable balance between Dice/NSD accuracy and computational cost, with strong single-organ and multi-organ results and comprehensive ablations validating design choices.

Abstract

3D medical image segmentation is important for clinical diagnosis and treatment but faces challenges from high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. To alleviate these limitations, we propose TK-Mamba, a multimodal framework that fuses the linear-time Mamba with Kolmogorov-Arnold Networks (KAN) to form an efficient hybrid backbone. Our approach is characterized by two primary technical contributions. Firstly, we introduce the novel 3D-Group-Rational KAN (3D-GR-KAN), which marks the first application of KAN in 3D medical imaging, providing a superior and computationally efficient nonlinear feature transformation crucial for complex volumetric structures. Secondly, we devise a dual-branch text-driven strategy using Pubmedclip's embeddings. This strategy significantly enhances segmentation robustness and accuracy by simultaneously capturing inter-organ semantic relationships to mitigate label inconsistencies and aligning image features with anatomical texts. By combining this advanced backbone and vision-language knowledge, TK-Mamba offers a unified and scalable solution for both multi-organ and tumor segmentation. Experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance in both organ and tumor segmentation tasks, surpassing existing methods in both accuracy and efficiency. Our code is publicly available at https://github.com/yhy-whu/TK-Mamba

TK-Mamba: Marrying KAN With Mamba for Text-Driven 3D Medical Image Segmentation

TL;DR

This work tackles 3D medical image segmentation by addressing both computational efficiency and semantic robustness. It introduces TK-Mamba, a hybrid backbone that combines the linear-time Mamba with a novel 3D-GR-KAN nonlinear refiner, and a dual-branch text-driven mechanism built on PubMedCLIP embeddings to capture inter-organ semantics and align image features with anatomical descriptions. Key contributions include the 3D-GR-KAN module for data-adaptive nonlinear refinement and a two-branch text strategy that enhances segmentation accuracy for organs and tumors while mitigating label inconsistencies. Empirically, TK-Mamba achieves state-of-the-art performance on MSD and KiTS23 datasets, offering a favorable balance between Dice/NSD accuracy and computational cost, with strong single-organ and multi-organ results and comprehensive ablations validating design choices.

Abstract

3D medical image segmentation is important for clinical diagnosis and treatment but faces challenges from high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. To alleviate these limitations, we propose TK-Mamba, a multimodal framework that fuses the linear-time Mamba with Kolmogorov-Arnold Networks (KAN) to form an efficient hybrid backbone. Our approach is characterized by two primary technical contributions. Firstly, we introduce the novel 3D-Group-Rational KAN (3D-GR-KAN), which marks the first application of KAN in 3D medical imaging, providing a superior and computationally efficient nonlinear feature transformation crucial for complex volumetric structures. Secondly, we devise a dual-branch text-driven strategy using Pubmedclip's embeddings. This strategy significantly enhances segmentation robustness and accuracy by simultaneously capturing inter-organ semantic relationships to mitigate label inconsistencies and aligning image features with anatomical texts. By combining this advanced backbone and vision-language knowledge, TK-Mamba offers a unified and scalable solution for both multi-organ and tumor segmentation. Experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance in both organ and tumor segmentation tasks, surpassing existing methods in both accuracy and efficiency. Our code is publicly available at https://github.com/yhy-whu/TK-Mamba

Paper Structure

This paper contains 27 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Visualization of learned KAN splines (red) and the fixed GELU (black) for different channels in the Stage 3. Each function is overlaid on its corresponding pre-activation value distribution (blue). The learned splines exhibit diverse, data-adaptive shapes that clearly diverge from the fixed baseline.
  • Figure 2: Overview of the TK-Mamba framework. The visual feature extraction starts with standardized preprocessing and Stem, followed by the K-Mamba Module, which includes Gated Spatial Convolution (GSC), Tri-oriented Mamba (ToM), and 3D-GR-KAN components for feature extraction. The dual-branch text-driven strategy leverages PubMedCLIP for semantic enhancement and aligns features using a text-image contrastive loss. Features are fused through adaptive average pooling and an MLP, producing the segmentation mask, supervised by a combined Dice Loss, BCE Loss, and Contrastive loss.
  • Figure 3: Qualitative comparison of segmentation results on the KiTS23 and MSD datasets. Each row corresponds to a task (Liver, Lung, Pancreas, Hepatic Vessel, Colon, KiTS23).