Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection

Eftekhar Hossain; Omar Sharif; Mohammed Moshiul Hoque; Sarah M. Preum

Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection

Eftekhar Hossain, Omar Sharif, Mohammed Moshiul Hoque, Sarah M. Preum

TL;DR

This work addresses multimodal hateful meme detection by introducing MCA-SCF, a context-aware architecture that aligns visual and textual features via an attention mechanism before fusion. The model uses a ResNet50-based visual encoder and a BiLSTM-based textual encoder, with Bahdanau-style alignment producing context vectors that form a context-rich multimodal representation $M_{sf}$. Evaluated on MUTE (Bangla code-mixed) and MultiOFF (English), MCA-SCF achieves state-of-the-art F1 scores of $0.697$ and $0.703$, respectively, outperforming baselines by up to $3.2$ percentage points. Ablation and error analyses indicate that while contextualized embeddings provide limited gains, the alignment strategy substantially improves cross-language hateful meme detection, demonstrating strong generalization and practical potential for multilingual deployment.

Abstract

Multimodal hateful content detection is a challenging task that requires complex reasoning across visual and textual modalities. Therefore, creating a meaningful multimodal representation that effectively captures the interplay between visual and textual features through intermediate fusion is critical. Conventional fusion techniques are unable to attend to the modality-specific features effectively. Moreover, most studies exclusively concentrated on English and overlooked other low-resource languages. This paper proposes a context-aware attention framework for multimodal hateful content detection and assesses it for both English and non-English languages. The proposed approach incorporates an attention layer to meaningfully align the visual and textual features. This alignment enables selective focus on modality-specific features before fusing them. We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English). Evaluation results demonstrate our proposed approach's effectiveness with F1-scores of $69.7$% and $70.3$% for the MUTE and MultiOFF datasets. The scores show approximately $2.5$% and $3.2$% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at https://github.com/eftekhar-hossain/Bengali-Hateful-Memes.

Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection

TL;DR

. Evaluated on MUTE (Bangla code-mixed) and MultiOFF (English), MCA-SCF achieves state-of-the-art F1 scores of

and

, respectively, outperforming baselines by up to

percentage points. Ablation and error analyses indicate that while contextualized embeddings provide limited gains, the alignment strategy substantially improves cross-language hateful meme detection, demonstrating strong generalization and practical potential for multilingual deployment.

Abstract

% and

% for the MUTE and MultiOFF datasets. The scores show approximately

% and

% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at https://github.com/eftekhar-hossain/Bengali-Hateful-Memes.

Paper Structure (22 sections, 11 equations, 5 figures, 5 tables)

This paper contains 22 sections, 11 equations, 5 figures, 5 tables.

Introduction
Related Work
Method
Proposed (MCA-SCF) Architecture
Preprocessing
Visual and Textual Feature Extractor
Alignment and Fusion
Experiments and Results
Datasets
MUTE hossain-etal-2022-mute:
MultiOFF suryawanshi-etal-2020-multimodal:
Baselines
Unimodal Models
Multimodal Models
Results
...and 7 more sections

Figures (5)

Figure 1: Example of hateful memes. In isolation, neither the image nor the caption may appear hateful, but when combined, they can convey a hateful message.
Figure 2: Our proposed context-aware multimodal architecture: $v$ and $t$ are the processed image and its corresponding caption. The upper block represents the visual feature extractor, and the lower block is the textual feature extractor. Alignment scores ($\alpha_{yj}$) are calculated by applying attention on visual ($V_f$) and textual ($h_1...h_l$) features. Subsequently, visual ($C_v$) and textual ($C_t$) context vectors are created by aligning ($V_f$) and ($h_1...h_l$) through alignment vector ($\alpha_{yj}$). Finally, by concatenating these context vectors ($C_v, C_t$) with modality-specific features ($V_f$, $h_l$) our method creates the multimodal context-aware representation $M_{sf}$.
Figure 3: Misclassification rate comparison between various fusion approaches (i.e., early, late, attentive) and proposed (MCA-SCF) method on both datasets.
Figure 4: Example (a) shows a meme where the proposed method yields better predictions, and example (b) illustrates a wrongly classified sample. The symbol (✓) and (✗) indicates the correct and incorrect prediction. EF and AF represent the early fusion and attentive fusion approaches, respectively.
Figure A.1: Variants of the proposed MCA-SCF framework. The majority of the components remain the same as illustrated in figure \ref{['block']}. The three variants ($V_{gf}, M_{cf}, T_{gf}$) have differences in the way they integrate information to emphasize the context of a particular modality.

Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection

TL;DR

Abstract

Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)