Table of Contents
Fetching ...

New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis

Quy Hoang Nguyen, Minh-Van Truong Nguyen, Kiet Van Nguyen

TL;DR

This paper introduces ViMACSA, a Vietnamese multimodal MACSA dataset with 4,876 text–image pairs and 14,618 fine-grained annotations in the hotel domain, and proposes Fine-Grained Cross-Modal Fusion (FCMF) to learn fine-grained intra- and inter-modality interactions. The framework uses an auxiliary sentence constructed from aspect, context, image categories, and RoI categories, image-guided attention, and geometric RoI-aware attention to fuse textual and visual cues for sentiment prediction across six hotel aspects. Empirical results show that FCMF outperforms state-of-the-art baselines, achieving a macro-F1 of 79.73% on ViMACSA, with ablation confirming the importance of auxiliary sentences and RoI geometry. The work provides a valuable Vietnamese multimodal benchmark and a robust fusion approach that leverages fine-grained image information to improve MACSA, addressing language-specific challenges like misspellings and abbreviations while enabling future cross-domain research.

Abstract

The emergence of multimodal data on social media platforms presents new opportunities to better understand user sentiments toward a given aspect. However, existing multimodal datasets for Aspect-Category Sentiment Analysis (ACSA) often focus on textual annotations, neglecting fine-grained information in images. Consequently, these datasets fail to fully exploit the richness inherent in multimodal. To address this, we introduce a new Vietnamese multimodal dataset, named ViMACSA, which consists of 4,876 text-image pairs with 14,618 fine-grained annotations for both text and image in the hotel domain. Additionally, we propose a Fine-Grained Cross-Modal Fusion Framework (FCMF) that effectively learns both intra- and inter-modality interactions and then fuses these information to produce a unified multimodal representation. Experimental results show that our framework outperforms SOTA models on the ViMACSA dataset, achieving the highest F1 score of 79.73%. We also explore characteristics and challenges in Vietnamese multimodal sentiment analysis, including misspellings, abbreviations, and the complexities of the Vietnamese language. This work contributes both a benchmark dataset and a new framework that leverages fine-grained multimodal information to improve multimodal aspect-category sentiment analysis. Our dataset is available for research purposes: https://github.com/hoangquy18/Multimodal-Aspect-Category-Sentiment-Analysis.

New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis

TL;DR

This paper introduces ViMACSA, a Vietnamese multimodal MACSA dataset with 4,876 text–image pairs and 14,618 fine-grained annotations in the hotel domain, and proposes Fine-Grained Cross-Modal Fusion (FCMF) to learn fine-grained intra- and inter-modality interactions. The framework uses an auxiliary sentence constructed from aspect, context, image categories, and RoI categories, image-guided attention, and geometric RoI-aware attention to fuse textual and visual cues for sentiment prediction across six hotel aspects. Empirical results show that FCMF outperforms state-of-the-art baselines, achieving a macro-F1 of 79.73% on ViMACSA, with ablation confirming the importance of auxiliary sentences and RoI geometry. The work provides a valuable Vietnamese multimodal benchmark and a robust fusion approach that leverages fine-grained image information to improve MACSA, addressing language-specific challenges like misspellings and abbreviations while enabling future cross-domain research.

Abstract

The emergence of multimodal data on social media platforms presents new opportunities to better understand user sentiments toward a given aspect. However, existing multimodal datasets for Aspect-Category Sentiment Analysis (ACSA) often focus on textual annotations, neglecting fine-grained information in images. Consequently, these datasets fail to fully exploit the richness inherent in multimodal. To address this, we introduce a new Vietnamese multimodal dataset, named ViMACSA, which consists of 4,876 text-image pairs with 14,618 fine-grained annotations for both text and image in the hotel domain. Additionally, we propose a Fine-Grained Cross-Modal Fusion Framework (FCMF) that effectively learns both intra- and inter-modality interactions and then fuses these information to produce a unified multimodal representation. Experimental results show that our framework outperforms SOTA models on the ViMACSA dataset, achieving the highest F1 score of 79.73%. We also explore characteristics and challenges in Vietnamese multimodal sentiment analysis, including misspellings, abbreviations, and the complexities of the Vietnamese language. This work contributes both a benchmark dataset and a new framework that leverages fine-grained multimodal information to improve multimodal aspect-category sentiment analysis. Our dataset is available for research purposes: https://github.com/hoangquy18/Multimodal-Aspect-Category-Sentiment-Analysis.
Paper Structure (37 sections, 24 equations, 14 figures, 8 tables, 2 algorithms)

This paper contains 37 sections, 24 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: Examples of the Multimodal ACSA task in Vietnamese.
  • Figure 2: Three-stage annotation process for the ViMACSA dataset.
  • Figure 3: Cohen's Kappa Score of the training phase.
  • Figure 4: IoU Score of the training phase.
  • Figure 5: Distribution of 6 Aspect Categories in ViMACSA dataset.
  • ...and 9 more figures