Table of Contents
Fetching ...

Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language

Fatema Tuj Johora Faria, Mukaffi Bin Moin, Md. Mahfuzur Rahman, Md Morshed Alam Shanto, Asif Iftekher Fahim, Md. Moinul Hoque

TL;DR

This work addresses the challenge of author intent classification in low-resource Bangla social media by introducing the Uddessho multimodal dataset and the MABIC framework, which fuses text and image features through Early and Late Fusion. Text-only models reach up to 64.53% accuracy, while multimodal fusion achieves 76.19%, marking an 11.66 percentage point improvement and establishing a new Bangla benchmark. The dataset comprises 3,048 posts across six intents and includes rigorous annotation with high inter-annotator agreement, enabling robust evaluation. The study advances Bangla NLP and multimodal understanding, offering a public resource, baseline results, and a roadmap for future explainable, domain-specific enhancements.

Abstract

With the increasing popularity of daily information sharing and acquisition on the Internet, this paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts where individuals share their thoughts and opinions. The proposed method leverages multimodal data with particular emphasis on authorship identification, aiming to understand the underlying purpose behind textual content, especially in the context of varied user-generated posts on social media. Current methods often face challenges in low-resource languages like Bangla, particularly when author traits intricately link with intent, as observed in social media posts. To address this, we present the Multimodal-based Author Bangla Intent Classification (MABIC) framework, utilizing text and images to gain deeper insights into the conveyed intentions. We have created a dataset named "Uddessho," comprising 3,048 instances sourced from social media. Our methodology comprises two approaches for classifying textual intent and multimodal author intent, incorporating early fusion and late fusion techniques. In our experiments, the unimodal approach achieved an accuracy of 64.53% in interpreting Bangla textual intent. In contrast, our multimodal approach significantly outperformed traditional unimodal methods, achieving an accuracy of 76.19%. This represents an improvement of 11.66%. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.

Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language

TL;DR

This work addresses the challenge of author intent classification in low-resource Bangla social media by introducing the Uddessho multimodal dataset and the MABIC framework, which fuses text and image features through Early and Late Fusion. Text-only models reach up to 64.53% accuracy, while multimodal fusion achieves 76.19%, marking an 11.66 percentage point improvement and establishing a new Bangla benchmark. The dataset comprises 3,048 posts across six intents and includes rigorous annotation with high inter-annotator agreement, enabling robust evaluation. The study advances Bangla NLP and multimodal understanding, offering a public resource, baseline results, and a roadmap for future explainable, domain-specific enhancements.

Abstract

With the increasing popularity of daily information sharing and acquisition on the Internet, this paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts where individuals share their thoughts and opinions. The proposed method leverages multimodal data with particular emphasis on authorship identification, aiming to understand the underlying purpose behind textual content, especially in the context of varied user-generated posts on social media. Current methods often face challenges in low-resource languages like Bangla, particularly when author traits intricately link with intent, as observed in social media posts. To address this, we present the Multimodal-based Author Bangla Intent Classification (MABIC) framework, utilizing text and images to gain deeper insights into the conveyed intentions. We have created a dataset named "Uddessho," comprising 3,048 instances sourced from social media. Our methodology comprises two approaches for classifying textual intent and multimodal author intent, incorporating early fusion and late fusion techniques. In our experiments, the unimodal approach achieved an accuracy of 64.53% in interpreting Bangla textual intent. In contrast, our multimodal approach significantly outperformed traditional unimodal methods, achieving an accuracy of 76.19%. This represents an improvement of 11.66%. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.
Paper Structure (14 sections, 2 figures, 3 tables)

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The diagram illustrates the MABIC framework, which has two fusion approaches: Early Fusion and Late Fusion.
  • Figure 2: Error Analysis of MABIC framework results, illustrating both early fusion and late fusion techniques.