Table of Contents
Fetching ...

Exploring Vision Language Models for Multimodal and Multilingual Stance Detection

Jake Vasilakes, Carolina Scarton, Zhixue Zhao

TL;DR

This work addresses the gap in stance detection for multimodal and multilingual social media by evaluating four Vision-Language Models on an extended multilingual multimodal dataset spanning seven languages. Using 0-shot evaluation and targeted ablations, the study reveals that VLMs predominantly rely on text, including text embedded in images, to predict stance, with vision contributing modest gains overall. It also analyzes cross-language behavior, finding general cross-language consistency but notable outliers, and demonstrates that model choice matters for multilingual robustness, with Ovis showing the strongest cross-language stability. The findings highlight the practical potential of VLMs for cross-lingual multimodal stance tasks while underscoring the need to better capture in-image text and to broaden language coverage and dataset resources.

Abstract

Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.

Exploring Vision Language Models for Multimodal and Multilingual Stance Detection

TL;DR

This work addresses the gap in stance detection for multimodal and multilingual social media by evaluating four Vision-Language Models on an extended multilingual multimodal dataset spanning seven languages. Using 0-shot evaluation and targeted ablations, the study reveals that VLMs predominantly rely on text, including text embedded in images, to predict stance, with vision contributing modest gains overall. It also analyzes cross-language behavior, finding general cross-language consistency but notable outliers, and demonstrates that model choice matters for multilingual robustness, with Ovis showing the strongest cross-language stability. The findings highlight the practical potential of VLMs for cross-lingual multimodal stance tasks while underscoring the need to better capture in-image text and to broaden language coverage and dataset resources.

Abstract

Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.

Paper Structure

This paper contains 31 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The instruction prompt templates for the Tweet & Image (top) and Image Text (bottom) experiments. <|image|> is the special token whose embedding is set according to the vision model, {target} is the stance target for the given example, such as "Joe Biden" or "Merger and acquisition between Aetna and Humana", {tweet} is the tweet text, and {image_text} is the plain text extracted from the image by the OCR tool.
  • Figure 2: Left: An image which contains text that would be useful for prediction of the stance regarding Donald Trump. Middle: The same image after covering up the text using the bounding box output of the OCR tool, as used in the Text Blackout experiments. Right: Likewise for the Content Blackout experiments. Bottom: The plain text extracted by the OCR tool as used in the Image Text experiments.
  • Figure 3: Performance of each VLM on the English dataset for each evaluation scenario. Statistical significance vs. Tweet & Image indicated as * p $\leq$ 0.05, ** p $\leq 0.005$.
  • Figure 4: Macro F1 of each VLM on each language in each evaluation scenario. Statistical significance vs. the corresponding English results is computed using a McNemar's test and is indicated as * p $\leq$ 0.05, ** p $\leq 0.005$.
  • Figure 5: Cohen's kappa values between predictions for each pair of languages in the Text & Image scenario. The Text/Image Only scenarios exhibit the same trends so are not pictured for space reasons.
  • ...and 2 more figures