Table of Contents
Fetching ...

Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

Raihan Tanvir, Md. Golam Rabiul Alam

TL;DR

A FAISS-based k-nearest neighbor classifier for non-parametric inference and RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning are developed and introduced, demonstrating the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Abstract

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

TL;DR

A FAISS-based k-nearest neighbor classifier for non-parametric inference and RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning are developed and introduced, demonstrating the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Abstract

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.
Paper Structure (30 sections, 8 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 8 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: A sample meme from the Bengali Hateful Memes dataset, showcasing nuanced visual and textual elements that convey hateful sentiment.
  • Figure 2: Overview of the xDORA architecture. Visual and textual features are extracted using pretrained encoders. A dual co-attention module with multi-head attention integrates the modalities through two cross-attention flows: I2T-ACT (Image-to-Text Attention Conditioned on Text, Equation \ref{['eq:attn1']}) and I2I-ACT (Image-to-Image Attention Conditioned on Text, Equation \ref{['eq:attn2']}). The fused representation is aggregated via weighted attention pooling and fed to a multilayer perceptron for classification.
  • Figure 3: Architecture of RAG-Fused DORA. A multimodal embedding $\mathbf{Z}$ is obtained using xDORA as a feature extractor. This embedding is used in two parallel branches: (1) it is passed through an MLP classifier to produce prediction scores; and (2) it is used to retrieve the top-$k$ nearest neighbors from a FAISS index based on cosine similarity ($\text{Top-}k[\cos(\mathbf{Z}, \mathbf{Z}_{\text{db}})]$). The retrieved labels are aggregated using similarity-weighted label distribution. Final predictions are computed via a weighted ensemble of the classifier logits and the FAISS-derived label distribution.
  • Figure 4: Illustration of the RAG-Prompted LLaVA framework. Given a test image, top-$k$ semantically similar exemplars per class are retrieved from a FAISS index using xDORA-generated embeddings. The retrieved text-label pairs are formatted as few-shot exemplars and incorporated into a prompt alongside the test caption. This retrieval-augmented prompt is then fed to LLaVA, enabling few-shot multimodal classification.