Table of Contents
Fetching ...

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Taixi Chen, Jingyun Chen, Nancy Guo

TL;DR

The work targets cell-level radiomics on H&E images, where a dedicated backbone is missing. It proposes Unified Attention-Mamba (UAM), a flexible backbone with Amamba and Amamba-MoE encoders that integrate Mamba-derived context with cross-attention and a MoE fuse. A multimodal extension fuses radiomics embeddings with BiomedParse image features for joint cell classification and segmentation. On public benchmarks, UAM achieves state-of-the-art performance, including cell classification accuracy improvements up to 92.06% on certain datasets and segmentation gains over image-based baselines, demonstrating its potential for radiomics-driven cancer diagnosis.

Abstract

Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

TL;DR

The work targets cell-level radiomics on H&E images, where a dedicated backbone is missing. It proposes Unified Attention-Mamba (UAM), a flexible backbone with Amamba and Amamba-MoE encoders that integrate Mamba-derived context with cross-attention and a MoE fuse. A multimodal extension fuses radiomics embeddings with BiomedParse image features for joint cell classification and segmentation. On public benchmarks, UAM achieves state-of-the-art performance, including cell classification accuracy improvements up to 92.06% on certain datasets and segmentation gains over image-based baselines, demonstrating its potential for radiomics-driven cancer diagnosis.

Abstract

Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% (=349,882 cells), and tumor segmentation precision from 75% to 80% (=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

Paper Structure

This paper contains 18 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overall pipeline. Using ClinSegAI, high-throughput cellular radiomics features are extracted directly from the source images. These features are then processed by the UAM model, which generates cell-level diagnostic predictions.
  • Figure 2: Overall architecture of the proposed Unified Attention-Mamba (UAM) block. Unlike Jamba, UAM integrates normalization, Amamba, and Amamba-MoE layers without fixed ratio constraints, enabling flexible fusion of attention and Mamba mechanisms. Specifically, the Amamba layer leverages Mamba to generate cross-attention values, efficiently enhancing long-range dependency modeling. Meanwhile, the Amamba-MoE layer concatenates Mamba and self-attention outputs within a mixture-of-experts (MoE) framework, providing a comprehensive, multi-perspective representation of radiomics information for advanced processing.
  • Figure 3: Overview of the multimodal model architecture. The proposed model leverages radiomics data for tumor cell classification and integrates cell radiomics, image, and prompt information for effective tumor segmentation. Cell radiomics embeddings are projected into the image embedding space to enable seamless multimodal fusion. A pretrained BiomedParse encoder is employed for joint image–text feature extraction, while its decoder generates segmentation masks based on the concatenated multimodal embeddings.
  • Figure 4: Comparison with the image-based SOTA models for cell classification on the IGNITE dataset and WSSS4LUAD dataset. **: p-value $<$$0.01$ (two sample t-tests).
  • Figure 5: Visual comparison of ground truth and UAM predictions on the IGNITE dataset. Tumor cells are highlighted in green.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Remark 1