Table of Contents
Fetching ...

Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers

Ruofei Wang, Hongzhan Lin, Ziyuan Luo, Ka Chun Cheung, Simon See, Jing Ma, Renjie Wan

TL;DR

This work addresses the security risk of backdoor attacks on hateful meme detection by introducing Meme Trojan, which leverages a Cross-Modal Trigger (CMT) and a Trigger Augmentor to activate backdoors across both visual and textual modalities. The CMT embeds a text-like pattern (e.g., the string "..") into meme content and uses OCR-based extraction to propagate the trigger to the text modality, with the Trigger Augmentor refining the trigger to reduce false activations. Extensive experiments on FBHM, MAMI, and HarMeme across multiple detectors show that CMT outperforms the prior multimodal backdoor TrojVQA and the CMT without augmentation, achieving higher attack success and stealth while remaining robust to Neural Polarizer defenses. The results underscore significant real-world risks for automated hateful meme detection and motivate the development of defenses, OCR improvements, and broader evaluations across datasets and defenses.

Abstract

Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To explore this, we propose the Meme Trojan framework to initiate backdoor attacks on hateful meme detection. Meme Trojan involves creating a novel Cross-Modal Trigger (CMT) and a learnable trigger augmentor to enhance the trigger pattern according to each input sample. Due to the cross-modal property, the proposed CMT can effectively initiate backdoor attacks on hateful meme detectors under an automatic application scenario. Additionally, the injection position and size of our triggers are adaptive to the texts contained in the meme, which ensures that the trigger is seamlessly integrated with the meme content. Our approach outperforms the state-of-the-art backdoor attack methods, showing significant improvements in effectiveness and stealthiness. We believe that this paper will draw more attention to the potential threat posed by backdoor attacks on hateful meme detection.

Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers

TL;DR

This work addresses the security risk of backdoor attacks on hateful meme detection by introducing Meme Trojan, which leverages a Cross-Modal Trigger (CMT) and a Trigger Augmentor to activate backdoors across both visual and textual modalities. The CMT embeds a text-like pattern (e.g., the string "..") into meme content and uses OCR-based extraction to propagate the trigger to the text modality, with the Trigger Augmentor refining the trigger to reduce false activations. Extensive experiments on FBHM, MAMI, and HarMeme across multiple detectors show that CMT outperforms the prior multimodal backdoor TrojVQA and the CMT without augmentation, achieving higher attack success and stealth while remaining robust to Neural Polarizer defenses. The results underscore significant real-world risks for automated hateful meme detection and motivate the development of defenses, OCR improvements, and broader evaluations across datasets and defenses.

Abstract

Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To explore this, we propose the Meme Trojan framework to initiate backdoor attacks on hateful meme detection. Meme Trojan involves creating a novel Cross-Modal Trigger (CMT) and a learnable trigger augmentor to enhance the trigger pattern according to each input sample. Due to the cross-modal property, the proposed CMT can effectively initiate backdoor attacks on hateful meme detectors under an automatic application scenario. Additionally, the injection position and size of our triggers are adaptive to the texts contained in the meme, which ensures that the trigger is seamlessly integrated with the meme content. Our approach outperforms the state-of-the-art backdoor attack methods, showing significant improvements in effectiveness and stealthiness. We believe that this paper will draw more attention to the potential threat posed by backdoor attacks on hateful meme detection.

Paper Structure

This paper contains 51 sections, 3 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: (I): Memes possess a special property: combining the same text with different images or vice versa would convey opposite meanings. (II): Under backdoor attacks, the hateful meme detector could accurately identify benign samples but produce malicious results when encountering specific triggers, resulting in the proliferation of hateful memes. Figures (a), (b), and (c) are the poisoned samples of TrojVQA walmer2022dual, and our cross-modal trigger without and with trigger augmentor, respectively. Detailed illustration about each meme is discussed in the Supplementary Materials.
  • Figure 2: The framework of our Meme Trojan, including Cross-Modal Trigger (CMT) injection, backdoor model training, and backdoor model attacking.
  • Figure 3: Comparison between FIBA, consider-like pattern (BadNL), red pattern, random pattern, CMT w/o TA, and CMT.
  • Figure 4: An input meme poisoned by our CMT with different blending parameter $\lambda$.
  • Figure 5: Pipeline of the new meme generation by multimodal large language models (LLaVA liu2024visual) and image generation models (Diffusion model rombach2022high).
  • ...and 1 more figures