Table of Contents
Fetching ...

TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis

Xiang Li, Xianfu Cheng, Dezhuang Miao, Xiaoming Zhang, Zhoujun Li

TL;DR

TF-Mamba addresses robust multimodal sentiment analysis under missing modalities by integrating text-dominant strategies into an efficient Mamba framework. It introduces three components—Text-aware Modality Enhancement (TME), Text-based Context Mamba (TC-Mamba), and Text-guided Query Mamba (TQ-Mamba)—to align/enhance non-text modalities, model intra-modal context, and perform text-guided cross-modal fusion. Empirical results on MOSI, MOSEI, and SIMS show TF-Mamba achieving superior robustness and efficiency compared with Transformer-based baselines while reducing FLOPs and parameters. The work demonstrates the practicality of linear-time, text-led fusion for robust MSA, with a public implementation and clear avenues for future real-world missing-pattern handling and end-to-end optimization.

Abstract

Multimodal Sentiment Analysis (MSA) with missing modalities has attracted increasing attention recently. While current Transformer-based methods leverage dense text information to maintain model robustness, their quadratic complexity hinders efficient long-range modeling and multimodal fusion. To this end, we propose a novel and efficient Text-enhanced Fusion Mamba (TF-Mamba) framework for robust MSA with missing modalities. Specifically, a Text-aware Modality Enhancement (TME) module aligns and enriches non-text modalities, while reconstructing the missing text semantics. Moreover, we develop Text-based Context Mamba (TC-Mamba) to capture intra-modal contextual dependencies under text collaboration. Finally, Text-guided Query Mamba (TQ-Mamba) queries text-guided multimodal information and learns joint representations for sentiment prediction. Extensive experiments on three MSA datasets demonstrate the effectiveness and efficiency of the proposed method under missing modality scenarios. Our code is available at https://github.com/codemous/TF-Mamba.

TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis

TL;DR

TF-Mamba addresses robust multimodal sentiment analysis under missing modalities by integrating text-dominant strategies into an efficient Mamba framework. It introduces three components—Text-aware Modality Enhancement (TME), Text-based Context Mamba (TC-Mamba), and Text-guided Query Mamba (TQ-Mamba)—to align/enhance non-text modalities, model intra-modal context, and perform text-guided cross-modal fusion. Empirical results on MOSI, MOSEI, and SIMS show TF-Mamba achieving superior robustness and efficiency compared with Transformer-based baselines while reducing FLOPs and parameters. The work demonstrates the practicality of linear-time, text-led fusion for robust MSA, with a public implementation and clear avenues for future real-world missing-pattern handling and end-to-end optimization.

Abstract

Multimodal Sentiment Analysis (MSA) with missing modalities has attracted increasing attention recently. While current Transformer-based methods leverage dense text information to maintain model robustness, their quadratic complexity hinders efficient long-range modeling and multimodal fusion. To this end, we propose a novel and efficient Text-enhanced Fusion Mamba (TF-Mamba) framework for robust MSA with missing modalities. Specifically, a Text-aware Modality Enhancement (TME) module aligns and enriches non-text modalities, while reconstructing the missing text semantics. Moreover, we develop Text-based Context Mamba (TC-Mamba) to capture intra-modal contextual dependencies under text collaboration. Finally, Text-guided Query Mamba (TQ-Mamba) queries text-guided multimodal information and learns joint representations for sentiment prediction. Extensive experiments on three MSA datasets demonstrate the effectiveness and efficiency of the proposed method under missing modality scenarios. Our code is available at https://github.com/codemous/TF-Mamba.

Paper Structure

This paper contains 36 sections, 18 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of the TF-Mamba framework, which consists of three main components: Text-aware Modality Enhancement (TME), Text-based Context Mamba (TC-Mamba), and Text-guided Query Mamba (TQ-Mamba). Yellow blocks indicate the dominant role of the text modality in the training pipeline.
  • Figure 2: An illustration of TC-Mamba with text and visual inputs. Red dashed lines indicate shared state transition matrices across Bi-Mamba blocks. The symbol F denotes the temporal flip operation.
  • Figure 3: Performance trends of models under varying missing rates on MOSI, MOSEI, and SIMS datasets.
  • Figure 4: Model performance and complexity comparison during inference on MOSI dataset.
  • Figure 5: Effect of the regularization parameter $\lambda$ on F1 Score and Acc-7 on the MOSI dataset.
  • ...and 3 more figures