TK-Mamba: Marrying KAN With Mamba for Text-Driven 3D Medical Image Segmentation
Haoyu Yang, Yutong Guan, Meixing Shi, Yuxiang Cai, Jintao Chen, Sun Bing, Wenhui Lei, Mianxin Liu, Xiaoming Shi, Yankai Jiang, Jianwei Yin
TL;DR
This work tackles 3D medical image segmentation by addressing both computational efficiency and semantic robustness. It introduces TK-Mamba, a hybrid backbone that combines the linear-time Mamba with a novel 3D-GR-KAN nonlinear refiner, and a dual-branch text-driven mechanism built on PubMedCLIP embeddings to capture inter-organ semantics and align image features with anatomical descriptions. Key contributions include the 3D-GR-KAN module for data-adaptive nonlinear refinement and a two-branch text strategy that enhances segmentation accuracy for organs and tumors while mitigating label inconsistencies. Empirically, TK-Mamba achieves state-of-the-art performance on MSD and KiTS23 datasets, offering a favorable balance between Dice/NSD accuracy and computational cost, with strong single-organ and multi-organ results and comprehensive ablations validating design choices.
Abstract
3D medical image segmentation is important for clinical diagnosis and treatment but faces challenges from high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. To alleviate these limitations, we propose TK-Mamba, a multimodal framework that fuses the linear-time Mamba with Kolmogorov-Arnold Networks (KAN) to form an efficient hybrid backbone. Our approach is characterized by two primary technical contributions. Firstly, we introduce the novel 3D-Group-Rational KAN (3D-GR-KAN), which marks the first application of KAN in 3D medical imaging, providing a superior and computationally efficient nonlinear feature transformation crucial for complex volumetric structures. Secondly, we devise a dual-branch text-driven strategy using Pubmedclip's embeddings. This strategy significantly enhances segmentation robustness and accuracy by simultaneously capturing inter-organ semantic relationships to mitigate label inconsistencies and aligning image features with anatomical texts. By combining this advanced backbone and vision-language knowledge, TK-Mamba offers a unified and scalable solution for both multi-organ and tumor segmentation. Experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance in both organ and tumor segmentation tasks, surpassing existing methods in both accuracy and efficiency. Our code is publicly available at https://github.com/yhy-whu/TK-Mamba
