Table of Contents
Fetching ...

Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings

Guy Barel, Oren Tsur, Dan Vilenchik

TL;DR

This work tackles stance detection in multi-party conversations by proposing TASTE, a multimodal model that jointly leverages content from utterances and social structure from interaction graphs. Textual representations are drawn from Sentence-BERT, while structural embeddings are learned via a $\max\text{-}cut$ formulation solved with Semi-Definite Programming ($\mathrm{SDP}$) relaxations, producing contextual speaker embeddings that are fused with content through a Gated Residual Network (GRN). Empirical results on 4Forums and CreateDebate show state-of-the-art performance, with analysis revealing that structure often provides a stronger signal than text, and that combining both modalities yields consistent gains (approximately $12\%$ on average). The work highlights the importance of social grounding in stance detection and demonstrates a scalable framework that can be extended to diverse conversational settings and languages.

Abstract

Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE -- a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding -- emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.

Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings

TL;DR

This work tackles stance detection in multi-party conversations by proposing TASTE, a multimodal model that jointly leverages content from utterances and social structure from interaction graphs. Textual representations are drawn from Sentence-BERT, while structural embeddings are learned via a formulation solved with Semi-Definite Programming () relaxations, producing contextual speaker embeddings that are fused with content through a Gated Residual Network (GRN). Empirical results on 4Forums and CreateDebate show state-of-the-art performance, with analysis revealing that structure often provides a stronger signal than text, and that combining both modalities yields consistent gains (approximately on average). The work highlights the importance of social grounding in stance detection and demonstrates a scalable framework that can be extended to diverse conversational settings and languages.

Abstract

Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE -- a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding -- emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.

Paper Structure

This paper contains 32 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of a discussion tree, its corresponding speakers interactions graph, and the derived speakers embedding using the max-cut SDP. Node colors correspond to speakers. Tree nodes represent utterances; Graph nodes represent speakers. Edge width corresponds to the number of interactions. Speakers' embeddings lie on an $n$-dimensional sphere and are rounded to discrete values by projection onto a random hyperplane (dashed line).
  • Figure 2: Illustration of the TASTE architecture.
  • Figure 3: A short excerpt from a longer discussion (“Do you support Gun Control?”) in 4Forums. Pro/Con indicates the true label. ✓ or ✗ in a blue (gray) circle indicate whether TASTE (S-BERT) assigned the correct label to the utterance.