Table of Contents
Fetching ...

DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis

Adamu Lawan, Juhua Pu, Haruna Yunusa, Muhammad Lawan, Aliyu Umar, Adamu Sani Yahya, Mahmoud Basi

TL;DR

This work tackles Multimodal Aspect-based Sentiment Analysis (MABSA) by addressing the inefficiency of conventional attention in capturing global context. It introduces DualKanbaFormer, a parallel architecture with Textual KanbaFormer and Visual KanbaFormer modules that integrate Aspect-Driven Sparse Attention (ADSA), a Selective State Space Model (Mamba), Kolmogorov-Arnold Networks (KANs), and Dynamic Tanh (DyT), along with a multimodal gated fusion layer for dynamic cross-modal interaction. The approach achieves state-of-the-art results on Twitter-15 and Twitter-17, with ablations confirming the contributions of Mamba, ADSA, KANs, and gating mechanisms. The model’s combination of efficient global-semantic modeling and fine-grained aspect-specific processing holds promise for scalable, real-time MABSA applications across multimodal data.

Abstract

Multimodal Aspect-based Sentiment Analysis (MABSA) enhances sentiment detection by integrating textual data with complementary modalities, such as images, to provide a more refined and comprehensive understanding of sentiment. However, conventional attention mechanisms, despite notable benchmarks, are hindered by quadratic complexity, limiting their ability to fully capture global contextual dependencies and rich semantic information in both modalities. To address this limitation, we introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis. Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to dynamically balance coarse-grained aggregation and fine-grained selection for aspect-focused precision, ensuring the preservation of both global context awareness and local precision in textual and visual representations. Additionally, we utilize the Selective State Space Model (Mamba) to capture extensive global semantic information across both modalities. Furthermore, We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability. To facilitate the effective integration of textual and visual features, we design a multimodal gated fusion layer that dynamically optimizes inter-modality interactions, significantly enhancing the models efficacy in MABSA tasks. Comprehensive experiments on two publicly available datasets reveal that DualKanbaFormer consistently outperforms several state-of-the-art (SOTA) models.

DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis

TL;DR

This work tackles Multimodal Aspect-based Sentiment Analysis (MABSA) by addressing the inefficiency of conventional attention in capturing global context. It introduces DualKanbaFormer, a parallel architecture with Textual KanbaFormer and Visual KanbaFormer modules that integrate Aspect-Driven Sparse Attention (ADSA), a Selective State Space Model (Mamba), Kolmogorov-Arnold Networks (KANs), and Dynamic Tanh (DyT), along with a multimodal gated fusion layer for dynamic cross-modal interaction. The approach achieves state-of-the-art results on Twitter-15 and Twitter-17, with ablations confirming the contributions of Mamba, ADSA, KANs, and gating mechanisms. The model’s combination of efficient global-semantic modeling and fine-grained aspect-specific processing holds promise for scalable, real-time MABSA applications across multimodal data.

Abstract

Multimodal Aspect-based Sentiment Analysis (MABSA) enhances sentiment detection by integrating textual data with complementary modalities, such as images, to provide a more refined and comprehensive understanding of sentiment. However, conventional attention mechanisms, despite notable benchmarks, are hindered by quadratic complexity, limiting their ability to fully capture global contextual dependencies and rich semantic information in both modalities. To address this limitation, we introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis. Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to dynamically balance coarse-grained aggregation and fine-grained selection for aspect-focused precision, ensuring the preservation of both global context awareness and local precision in textual and visual representations. Additionally, we utilize the Selective State Space Model (Mamba) to capture extensive global semantic information across both modalities. Furthermore, We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability. To facilitate the effective integration of textual and visual features, we design a multimodal gated fusion layer that dynamically optimizes inter-modality interactions, significantly enhancing the models efficacy in MABSA tasks. Comprehensive experiments on two publicly available datasets reveal that DualKanbaFormer consistently outperforms several state-of-the-art (SOTA) models.
Paper Structure (30 sections, 47 equations, 2 figures, 4 tables)

This paper contains 30 sections, 47 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: DualKanbaFormer complete architecture
  • Figure 2: Effect of different numbers of DualKanbaFormer layers