Table of Contents
Fetching ...

BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Lele Wang, Giuseppe Carenini

TL;DR

This work tackles the detection of rhetorical and psychological persuasion techniques in memes across multiple languages within SemEval-2024 Task 4. It introduces an intermediate meme caption generation step to bridge the modality gap between text and image, leveraging GPT-4 captions alongside meme text to fine-tune RoBERTa and CLIP, achieving strong gains across 12 subtasks and top placements in Subtask 2a/2b. The authors compare a wide range of models, with the ConcatRoBERTa architecture (image+text+caption) trained on GPT-4 captions delivering the best dev performance, highlighting the value of semantic information from generated captions for abstract visual semantics in memes. The study advances understanding of multimodal and multilingual persuasion detection and points to future work on improving image-based metaphor understanding and robustness to adversarial and domain-shift scenarios.

Abstract

Memes, combining text and images, frequently use metaphors to convey persuasive messages, shaping public opinion. Motivated by this, our team engaged in SemEval-2024 Task 4, a hierarchical multi-label classification task designed to identify rhetorical and psychological persuasion techniques embedded within memes. To tackle this problem, we introduced a caption generation step to assess the modality gap and the impact of additional semantic information from images, which improved our result. Our best model utilizes GPT-4 generated captions alongside meme text to fine-tune RoBERTa as the text encoder and CLIP as the image encoder. It outperforms the baseline by a large margin in all 12 subtasks. In particular, it ranked in top-3 across all languages in Subtask 2a, and top-4 in Subtask 2b, demonstrating quantitatively strong performance. The improvement achieved by the introduced intermediate step is likely attributable to the metaphorical essence of images that challenges visual encoders. This highlights the potential for improving abstract visual semantics encoding.

BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

TL;DR

This work tackles the detection of rhetorical and psychological persuasion techniques in memes across multiple languages within SemEval-2024 Task 4. It introduces an intermediate meme caption generation step to bridge the modality gap between text and image, leveraging GPT-4 captions alongside meme text to fine-tune RoBERTa and CLIP, achieving strong gains across 12 subtasks and top placements in Subtask 2a/2b. The authors compare a wide range of models, with the ConcatRoBERTa architecture (image+text+caption) trained on GPT-4 captions delivering the best dev performance, highlighting the value of semantic information from generated captions for abstract visual semantics in memes. The study advances understanding of multimodal and multilingual persuasion detection and points to future work on improving image-based metaphor understanding and robustness to adversarial and domain-shift scenarios.

Abstract

Memes, combining text and images, frequently use metaphors to convey persuasive messages, shaping public opinion. Motivated by this, our team engaged in SemEval-2024 Task 4, a hierarchical multi-label classification task designed to identify rhetorical and psychological persuasion techniques embedded within memes. To tackle this problem, we introduced a caption generation step to assess the modality gap and the impact of additional semantic information from images, which improved our result. Our best model utilizes GPT-4 generated captions alongside meme text to fine-tune RoBERTa as the text encoder and CLIP as the image encoder. It outperforms the baseline by a large margin in all 12 subtasks. In particular, it ranked in top-3 across all languages in Subtask 2a, and top-4 in Subtask 2b, demonstrating quantitatively strong performance. The improvement achieved by the introduced intermediate step is likely attributable to the metaphorical essence of images that challenges visual encoders. This highlights the potential for improving abstract visual semantics encoding.
Paper Structure (22 sections, 3 equations, 2 figures, 5 tables)

This paper contains 22 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The figure depicts the supervised fine-tuning loop of the LLaVA-1.5-7B model on the MemeCap dataset for caption generation. The OCR module extracts text from the meme images. The vision encoder (CLIP), a frozen component of LLaVA-1.5-7B, processes the meme images. The vision-language projector bridges the gap between CLIP's representation and the embedding space of Vicuna. While CLIP remains frozen, the vision-language projector is fine-tuned. Vicuna component experimented with both frozen and fine-tuned setups to generate captions.
  • Figure 2: The figure illustrates the architecture of ConcatRoBERTa, our best-performing model. The GPT4-V(ision) component generates a descriptive caption of the meme image. The caption is then combined with the text written in the meme, which is processed by the RoBERTa. The Vision encoder utilizes a pre-trained vision transformer model (CLIP-ViT), to encode and analyze the visual elements of the meme. The MLP Classifier takes the combined visual and textual representations and classifies the meme. RoBERTa and the MLP classifiers are fine-tuned, while CLIP remains frozen.