Table of Contents
Fetching ...

A CLIP-based siamese approach for meme classification

Javier Huertas-Tato, Christos Koutlis, Symeon Papadopoulos, David Camacho, Ioannis Kompatsiaris

TL;DR

SimCLIP tackles cross-modal meme classification by freezing CLIP encoders and applying a lightweight Siamese-style fusion that combines image and text embeddings via concatenation, absolute difference, and the Hadamard product. Evaluated across seven meme datasets, it achieves state-of-the-art performance on Harm-P and Memotion7k Task 3, while remaining competitive on others and outperforming heavier baselines in several cases. The approach emphasizes a compact, accessible baseline for meme analysis that leverages CLIP's shared embedding space to model text-image interactions without external context or ensembles. Limitations include generalization to in-the-wild memes and dataset-specific trade-offs, pointing to future work in larger meme-focused pretraining and knowledge-graph integration.

Abstract

Memes are an increasingly prevalent element of online discourse in social networks, especially among young audiences. They carry ideas and messages that range from humorous to hateful, and are widely consumed. Their potentially high impact requires adequate means of control to moderate their use in large scale. In this work, we propose SimCLIP a deep learning-based architecture for cross-modal understanding of memes, leveraging a pre-trained CLIP encoder to produce context-aware embeddings and a Siamese fusion technique to capture the interactions between text and image. We perform an extensive experimentation on seven meme classification tasks across six datasets. We establish a new state of the art in Memotion7k with a 7.25% relative F1-score improvement, and achieve super-human performance on Harm-P with 13.73% F1-Score improvement. Our approach demonstrates the potential for compact meme classification models, enabling accurate and efficient meme monitoring. We share our code at https://github.com/jahuerta92/meme-classification-simclip

A CLIP-based siamese approach for meme classification

TL;DR

SimCLIP tackles cross-modal meme classification by freezing CLIP encoders and applying a lightweight Siamese-style fusion that combines image and text embeddings via concatenation, absolute difference, and the Hadamard product. Evaluated across seven meme datasets, it achieves state-of-the-art performance on Harm-P and Memotion7k Task 3, while remaining competitive on others and outperforming heavier baselines in several cases. The approach emphasizes a compact, accessible baseline for meme analysis that leverages CLIP's shared embedding space to model text-image interactions without external context or ensembles. Limitations include generalization to in-the-wild memes and dataset-specific trade-offs, pointing to future work in larger meme-focused pretraining and knowledge-graph integration.

Abstract

Memes are an increasingly prevalent element of online discourse in social networks, especially among young audiences. They carry ideas and messages that range from humorous to hateful, and are widely consumed. Their potentially high impact requires adequate means of control to moderate their use in large scale. In this work, we propose SimCLIP a deep learning-based architecture for cross-modal understanding of memes, leveraging a pre-trained CLIP encoder to produce context-aware embeddings and a Siamese fusion technique to capture the interactions between text and image. We perform an extensive experimentation on seven meme classification tasks across six datasets. We establish a new state of the art in Memotion7k with a 7.25% relative F1-score improvement, and achieve super-human performance on Harm-P with 13.73% F1-Score improvement. Our approach demonstrates the potential for compact meme classification models, enabling accurate and efficient meme monitoring. We share our code at https://github.com/jahuerta92/meme-classification-simclip
Paper Structure (12 sections, 1 equation, 2 figures, 4 tables)

This paper contains 12 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Six memes from the Memotion7k chhavi2020memotion training set, illustrating the variety in format. From left to right. a) Comic-style template, completely harmless. b) Misogynistic image macro, not a common template but still requires context. c) Completely ironic, meant to mock similar images. d) Image macro, impact text up and down, a common template with contextual meaning. e) Just a tweet with an image, still considered a meme. f) Only text available, no image composition or meaning beyond the text.
  • Figure 2: Visualization of SimCLIP architecture. The input image is first processed by an OCR algorithm to extract the text. Image and text are CLIP-encoded and projected using a simple Siamese feed-forward network. Projections are concatenated ($\oplus$) with the absolute difference and Hadamard product to finally be processed by a feed-forward classification head. In multitask settings, losses are summed across heads. In multilabel settings the binary cross-entropy loss is considered, while in multiclass settings we use the categorical cross-entropy loss.