Table of Contents
Fetching ...

Human-Centered Multimodal Fusion for Sexism Detection in Memes with Eye-Tracking, Heart Rate, and EEG Signals

Iván Arcos, Paolo Rosso, Elena Gomis-Vicent

TL;DR

Results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.

Abstract

The automated detection of sexism in memes is a challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. Content-only models often fail to capture the complexity of human perception. To address this limitation, we introduce and validate a human-centered paradigm that augments standard content features with physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load, reflected in increased fixation counts and longer reaction times, as well as differences in EEG spectral power across the Alpha, Beta, and Gamma bands, suggesting more demanding neural processing. Building on these findings, we propose a multimodal fusion model that integrates physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a strong VLM-based baseline. The fusion is particularly effective for nuanced cases, boosting the F1-score for the most challenging fine-grained category, Misogyny and Non-Sexual Violence, by 26.3%. These results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.

Human-Centered Multimodal Fusion for Sexism Detection in Memes with Eye-Tracking, Heart Rate, and EEG Signals

TL;DR

Results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.

Abstract

The automated detection of sexism in memes is a challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. Content-only models often fail to capture the complexity of human perception. To address this limitation, we introduce and validate a human-centered paradigm that augments standard content features with physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load, reflected in increased fixation counts and longer reaction times, as well as differences in EEG spectral power across the Alpha, Beta, and Gamma bands, suggesting more demanding neural processing. Building on these findings, we propose a multimodal fusion model that integrates physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a strong VLM-based baseline. The fusion is particularly effective for nuanced cases, boosting the F1-score for the most challenging fine-grained category, Misogyny and Non-Sexual Violence, by 26.3%. These results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.
Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: EEG topographic maps of band-power differences across key experimental contrasts. For each subfigure, the top row shows the mean power for the first condition and the middle row for the second; the bottom row shows their difference (Condition 2 – Condition 1). Specifically: (a) Non-Sexist vs. Sexist; (b) Judgmental vs. Direct Sexism; (c) Non-Objectification vs. Objectification; (d) Neutral vs. Fear (OCR-based emotion). Columns correspond to Delta, Theta, Alpha, Beta, and Gamma bands. Red indicates power increase, blue indicates power decrease; red stars mark channels with statistically significant differences ($p<0.05$).
  • Figure 2: Architecture of the final hierarchical attention-based fusion model. It integrates enriched text (OCR + VLM caption) with sequences of physiological reactions from several subjects (EEG and Eye-Tracking/HR). Cross-attention mechanisms allow the model to learn correlations between specific textual tokens and physiological responses.
  • Figure 3: Performance on Task 1 (Binary Sexism Detection) with 95% confidence intervals. Bars represent model performance scores (Macro F1, F1+, and AUC), showing a progressive and statistically significant improvement ($p < 0.05$) as all physiological signals are added.
  • Figure 4: Attention-based interpretation of a correctly classified sexist meme.