Table of Contents
Fetching ...

Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

Yihua Wang, Qi Jia, Cong Xu, Feiyu Chen, Yuhan Liu, Haotian Zhang, Liang Jin, Lu Liu, Zhichun Wang

TL;DR

This work addresses shortcut learning in multimodal sarcasm detection by refining the benchmark (MUStARD++ to MUStARD++^R) to remove character cues, canned laughter, and explicit emotion labels. It introduces MCIB, a three-modality conditional information bottleneck framework that distills redundancy-free, complementary information across text, audio, and video. The approach uses a primary-modal and auxiliary-modal pairing with a latent state b, optimizing $I(x_p; b)$ for compression and $I(b; y | x_a)$ for retaining task-relevant, complementary information, via a variational ELBO-style objective. Empirical results on MUStARD++ and MUStARD++^R show state-of-the-art performance without shortcuts, with robust generalization to real-world conditions and competitive performance on extended multimodal sentiment datasets, arguing for MCIB as a generalizable plug-in for multimodal fusion tasks.

Abstract

Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model's generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++$^{R}$ by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

TL;DR

This work addresses shortcut learning in multimodal sarcasm detection by refining the benchmark (MUStARD++ to MUStARD++^R) to remove character cues, canned laughter, and explicit emotion labels. It introduces MCIB, a three-modality conditional information bottleneck framework that distills redundancy-free, complementary information across text, audio, and video. The approach uses a primary-modal and auxiliary-modal pairing with a latent state b, optimizing for compression and for retaining task-relevant, complementary information, via a variational ELBO-style objective. Empirical results on MUStARD++ and MUStARD++^R show state-of-the-art performance without shortcuts, with robust generalization to real-world conditions and competitive performance on extended multimodal sentiment datasets, arguing for MCIB as a generalizable plug-in for multimodal fusion tasks.

Abstract

Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model's generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++ by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

Paper Structure

This paper contains 25 sections, 37 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Multimodal sarcasm analysis often relies on sitcoms like Friends, which primarily feature character dialogues. The illustration presents the multimodal sarcasm detection task and highlights several shortcut learning issues.
  • Figure 2: The diagram illustrates the overall architecture of the MCIB model. The multimodal fusion component employs three parallel conditional information bottleneck structures to filter out irrelevant information and extract relevant information between each pair of modalities. For each pair of modalities, we first minimize the mutual information between the primary modality and the latent state to achieve filtering and compression through the information bottleneck. We then maximize the conditional mutual information among the auxiliary modality, latent state, and prediction target. Finally, the bidirectional optimization within CIB produces an intermediate representation $b$ that encapsulates the essential information required for our prediction target.
  • Figure 3: The diagram illustrates the optimization direction for multimodal fusion. The two circles represent correctly predicted samples by the Text modality alone and by the Text + Acoustic combination. The overlapping region shows samples correctly predicted by both T and T + A; region R represents samples correctly predicted by T but misclassified when A is added; and region C represents samples only correctly predicted with T + A, while T alone fails.
  • Figure 4: The optimization process of the multimodal conditional information bottleneck is shown. The left is the initial state, where the blue section represents the latent state $b$ generated from the primary modality, and the green cells denote the conditional mutual information between the auxiliary modality, latent state, and target. The right is the ideal state: $b$ contains all information relevant to the target $y$, free from redundancy, and integrates complementary information from the primary modality concerning the auxiliary modality.
  • Figure 5: By constructing three latent state $b_0$, $b_1$ and $b_2$, pertinent information transfer between the three modal $x_0$, $x_1$ and $x_2$ is facilitated. Finally, integrated data leads to the prediction of $y$.
  • ...and 2 more figures