Table of Contents
Fetching ...

Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model

Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang

TL;DR

The paper tackles multimodal stance detection in real-world, multi-turn social media conversations by introducing the MmMtCSD benchmark (21,340 examples across Tesla, Bitcoin, and Post-T targets) and a multimodal large language model framework called MLLM-SD. MLLM-SD fuses textual history and visual content via a textual encoder, a ViT-based visual encoder, and a LoRA-tuned LLaMA2-chat backend, using image captions from GPT4-Vision and carefully designed multimodal prompts. Experiments show state-of-the-art performance on MmMtCSD for both in-target and cross-target settings, with ablations underscoring the importance of image captions and chain-of-thought prompting. The work provides a new, challenging benchmark and demonstrates the practical impact of conversational context and multimodal reasoning for stance detection, with potential applications in web mining and content analysis.

Abstract

Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.

Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model

TL;DR

The paper tackles multimodal stance detection in real-world, multi-turn social media conversations by introducing the MmMtCSD benchmark (21,340 examples across Tesla, Bitcoin, and Post-T targets) and a multimodal large language model framework called MLLM-SD. MLLM-SD fuses textual history and visual content via a textual encoder, a ViT-based visual encoder, and a LoRA-tuned LLaMA2-chat backend, using image captions from GPT4-Vision and carefully designed multimodal prompts. Experiments show state-of-the-art performance on MmMtCSD for both in-target and cross-target settings, with ablations underscoring the importance of image captions and chain-of-thought prompting. The work provides a new, challenging benchmark and demonstrates the practical impact of conversational context and multimodal reasoning for stance detection, with potential applications in web mining and content analysis.

Abstract

Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
Paper Structure (24 sections, 4 equations, 3 figures, 7 tables)

This paper contains 24 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An example of multimodal multi-turn conversational stance detection, with symbols denoting "favor" (check), "against" (multiplication), and "none" (horizontal line) stances.
  • Figure 2: The architecture of our MLLM-SD framework.
  • Figure 3: Comparison of single-sentence and conversation results: Textual display represents the experimental outcomes for LLaMA2-70b, while multimodal display showcases the results for MLLM-SD.