CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

Rabeya Tus Sadia; Qiang Ye; Qiang Cheng

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

Rabeya Tus Sadia, Qiang Ye, Qiang Cheng

TL;DR

The introduction of CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem, establishes state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.

Abstract

Accurate prediction of RNA-associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM-2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context-dependent nature of molecular binding. We introduce CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk'' between modality-specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high-dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard-negative samples. Comprehensive experiments across three interaction categories, RNA-protein, RNA-small molecule, and RNA-RNA demonstrate that CrossLLM-Mamba achieves state-of-the-art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2\%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

TL;DR

Abstract

Paper Structure (39 sections, 10 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 10 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
RNA-Protein Interaction Prediction
RNA-RNA Interaction Prediction
RNA-Small Molecule Interaction Prediction
Biological Large Language Models
Multi-Modal Fusion Strategies
State Space Models in Computational Biology
Summary and Positioning
Overview of CrossLLM-Mamba
Language Model Embeddings for Biological Modalities
RNA Language Model Embeddings
Protein Language Model Embeddings
Small Molecule Language Model Embeddings
Robust Feature Alignment via Noise Injection
...and 24 more sections

Figures (5)

Figure 1: Sequence Embedding and Feature Extraction Pipeline. The proposed framework utilizes specialized pre-trained large language models to encode biological entities into high-dimensional feature vectors. (Top) Protein amino acid sequences are encoded using ESM2. (Middle) RNA nucleotide sequences are processed via RiNALMo. (Bottom) Small molecule graphs are constructed and encoded using MoleBERT. These distinct embedding streams serve as the initial input features for the downstream dual-path Mamba architecture.
Figure 2: The CrossLLM-Mamba Model Architecture. The framework processes multi-modal inputs (Protein, RNA, or Molecule feature vectors) through a dual-path pipeline. First, feature vectors are projected and aligned using a linear transformation with Gaussian noise injection ($\mathcal{N}(0, \sigma^2)$) to enhance robustness. These aligned features are encoded by parallel BiMamba Encoders (detailed in the right panel), which capture bidirectional sequential dependencies. The encoded representations are fused in the Cross-Mamba Interaction Module via sequence stacking and a BiMamba Mixer to explicitly model interaction flows. Finally, global average pooling aggregates the features for the MLP prediction head to output the interaction probability.
Figure 3: Performance Comparison on the RPI1460 Dataset. The boxplots illustrate the distribution of performance metrics (MCC, ACC, F1, Precision, Recall, and Specificity) across 5-fold cross-validation for various state-of-the-art methods. Our proposed CrossLLM-Mamba (shown in red) consistently outperforms existing baselines, achieving the highest median scores and lowest variance across all major metrics, particularly in MCC and F1-score, demonstrating its robustness and superior predictive capability.
Figure 4: Performance impact of removing specific architectural components. The full CrossMamba-Bio model (blue) significantly outperforms variants lacking the cross-modal state mixing or bidirectional context.
Figure 5: Ablation study on the number of BiMamba blocks in the modality-specific encoders and fusion module. Performance peaks at moderate depth (enc_blocks=3, Fusion blocks=2--3), while deeper stacks show diminishing returns.

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

TL;DR

Abstract

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)