Table of Contents
Fetching ...

Multi-modal Stance Detection: New Datasets and Model

Bin Liang, Ang Li, Jingqian Zhao, Lin Gui, Min Yang, Yue Yu, Kam-Fai Wong, Ruifeng Xu

TL;DR

This work addresses multi-modal stance detection on Twitter by leveraging paired text and images. It introduces five datasets Mtse, Mccq, Mwtwt, Mruc, Mtwq totaling 17,544 annotated samples across 12 targets, collected from Twitter and annotated via majority-vote. The core method, Targeted Multi-modal Prompt Tuning (TMPT), prompts textual encoders with target-specific prompts and instructs visual encoders with target-specific visual prompts, followed by a simple fusion and softmax classifier. Experiments demonstrate TMPT achieves state-of-the-art performance on in-target tasks and competitive results in zero-shot settings, with ablations confirming the importance of both textual and visual prompts, and analyses highlighting dataset characteristics and failure modes.

Abstract

Stance detection is a challenging task that aims to identify public opinion from social media platforms with respect to specific targets. Previous work on stance detection largely focused on pure texts. In this paper, we study multi-modal stance detection for tweets consisting of texts and images, which are prevalent in today's fast-growing social media platforms where people often post multi-modal messages. To this end, we create five new multi-modal stance detection datasets of different domains based on Twitter, in which each example consists of a text and an image. In addition, we propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT), where target information is leveraged to learn multi-modal stance features from textual and visual modalities. Experimental results on our five benchmark datasets show that the proposed TMPT achieves state-of-the-art performance in multi-modal stance detection.

Multi-modal Stance Detection: New Datasets and Model

TL;DR

This work addresses multi-modal stance detection on Twitter by leveraging paired text and images. It introduces five datasets Mtse, Mccq, Mwtwt, Mruc, Mtwq totaling 17,544 annotated samples across 12 targets, collected from Twitter and annotated via majority-vote. The core method, Targeted Multi-modal Prompt Tuning (TMPT), prompts textual encoders with target-specific prompts and instructs visual encoders with target-specific visual prompts, followed by a simple fusion and softmax classifier. Experiments demonstrate TMPT achieves state-of-the-art performance on in-target tasks and competitive results in zero-shot settings, with ablations confirming the importance of both textual and visual prompts, and analyses highlighting dataset characteristics and failure modes.

Abstract

Stance detection is a challenging task that aims to identify public opinion from social media platforms with respect to specific targets. Previous work on stance detection largely focused on pure texts. In this paper, we study multi-modal stance detection for tweets consisting of texts and images, which are prevalent in today's fast-growing social media platforms where people often post multi-modal messages. To this end, we create five new multi-modal stance detection datasets of different domains based on Twitter, in which each example consists of a text and an image. In addition, we propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT), where target information is leveraged to learn multi-modal stance features from textual and visual modalities. Experimental results on our five benchmark datasets show that the proposed TMPT achieves state-of-the-art performance in multi-modal stance detection.
Paper Structure (49 sections, 10 equations, 6 figures, 11 tables)

This paper contains 49 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: An example of a user expressing an "Against" stance towards "Donald Trump" and a "Favor" stance towards "Joe Biden" using multi-modal information.
  • Figure 2: The overall architecture of our proposed TMPT. Textual Prompting is devised for adapting the large pre-trained language model. Visual Prompt Tuning is devised for adapting the large pre-trained vision model.
  • Figure 3: Performance of using different pre-trained language models: RoBERTa DBLP:conf/emnlp/NguyenVN20 (top) and KEBERT kawintiranon-singh-2021-knowledge (bottom). The reported results are the Macro F1-score across all targets in a dataset on in-target multi-modal stance detection.
  • Figure 4: Performance of using different multi-modal fusion methods. The reported results are the Macro F1-score across all targets in a dataset on in-target multi-modal stance detection.
  • Figure 5: Visualization of a typical example.
  • ...and 1 more figures