Table of Contents
Fetching ...

On VLMs for Diverse Tasks in Multimodal Meme Classification

Deepesh Gavit, Debajyoti Mazumder, Samiran Das, Jasabanta Patro

TL;DR

This work systematically evaluates vision-language models for meme classification across multiple tasks using Memotion and MAMI datasets. It contrasts prompting strategies, LoRA-based fine-tuning, and a novel CoVExFiL pipeline that trains LLMs on VLM-generated explanations, finding that prompt-based three-step chain-of-thought prompts offer strong gains while LoRA underperforms. The centerpiece CoVExFiL demonstrates substantial improvements by distilling VLM reasoning into downstream LLMs, surpassing baselines and several SOTA references in several tasks. The study highlights both the potential and the limitations of current VLMs in capturing nuanced meme content and suggests directions for richer contextualization and multilingual capabilities.

Abstract

In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.

On VLMs for Diverse Tasks in Multimodal Meme Classification

TL;DR

This work systematically evaluates vision-language models for meme classification across multiple tasks using Memotion and MAMI datasets. It contrasts prompting strategies, LoRA-based fine-tuning, and a novel CoVExFiL pipeline that trains LLMs on VLM-generated explanations, finding that prompt-based three-step chain-of-thought prompts offer strong gains while LoRA underperforms. The centerpiece CoVExFiL demonstrates substantial improvements by distilling VLM reasoning into downstream LLMs, surpassing baselines and several SOTA references in several tasks. The study highlights both the potential and the limitations of current VLMs in capturing nuanced meme content and suggests directions for richer contextualization and multilingual capabilities.

Abstract

In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.

Paper Structure

This paper contains 33 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Schematic diagram of the three strategies mentioned in the experiment section.
  • Figure 2: Data samples from the Memotion dataset. For each meme, the full set of subcategories corresponding to each classification task is listed. The ground-truth label for each task is highlighted in green.
  • Figure 3: Example from the MAMI dataset illustrating two tasks: Task A (binary: 1 = misogynistic, 0 = non-misogynistic) and Task B (multi-label: 1/0 for shaming, stereotype, objectification, violence).
  • Figure 4: Example Prompt Template for Experiment 1. Here, we have specified the prompts we used in ZS, ZSC, FS, and FSC.
  • Figure 5: Examples from the test set with their corresponding gold labels are shown to illustrate the VLM's understanding. Memes where the task was performed well are marked in green, those performed moderately well or relatable are marked in blue, and those where the task was performed poorly are marked in red.
  • ...and 1 more figures