Table of Contents
Fetching ...

Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue

Kun Ouyang, Liqiang Jing, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie

TL;DR

EDGE tackles sentiment-enhanced sarcasm explanation in multimodal dialogue by integrating utterance and video-audio sentiments into a context-sentiment graph fed to a BART-based generator. It introduces lexicon-guided utterance sentiment refinement, a Joint Cross Attention-based Sentiment Inference (JCA-SI) for video-audio signals, and a two-tier graph encoding that captures context flow and sentiment relationships. Key contributions include the BabelSenticNet-driven sentiment refinement, the JCA-SI module, and the context-sentiment graph with weighted sentiment edges that improve generation quality on the WITS dataset, validated by automatic metrics and human evaluation. The work demonstrates that explicit sentiment modeling across modalities enhances sarcasm understanding and explanation, with practical implications for multimodal conversational systems and social-media analysis.

Abstract

Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (\ie utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which play important roles in reflecting sarcasm that essentially involves subtle sentiment contrasts. Nevertheless, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.

Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue

TL;DR

EDGE tackles sentiment-enhanced sarcasm explanation in multimodal dialogue by integrating utterance and video-audio sentiments into a context-sentiment graph fed to a BART-based generator. It introduces lexicon-guided utterance sentiment refinement, a Joint Cross Attention-based Sentiment Inference (JCA-SI) for video-audio signals, and a two-tier graph encoding that captures context flow and sentiment relationships. Key contributions include the BabelSenticNet-driven sentiment refinement, the JCA-SI module, and the context-sentiment graph with weighted sentiment edges that improve generation quality on the WITS dataset, validated by automatic metrics and human evaluation. The work demonstrates that explicit sentiment modeling across modalities enhances sarcasm understanding and explanation, with practical implications for multimodal conversational systems and social-media analysis.

Abstract

Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (\ie utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which play important roles in reflecting sarcasm that essentially involves subtle sentiment contrasts. Nevertheless, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.
Paper Structure (17 sections, 16 equations, 7 figures, 4 tables)

This paper contains 17 sections, 16 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A sample of the sarcasm explanation in dialogue from the WITS dataset DBLP:conf/acl/KumarKA022 and the corresponding sentiments.
  • Figure 2: Illustration of the proposed EDGE, which contains four components.
  • Figure 3: The utterance sentiment inference process for three example utterances. And we compare the refined sentiments with the original sentiments.
  • Figure 4: The example of a context-sentiment graph, which is constructed for a dialogue including three utterances. Tokens in red are the utterance sentiments and those in blue are video-audio sentiments. $n_j$ denotes the $j$-$th$ node in the context-sentiment graph.
  • Figure 5: The training curve for our EDGE in $60$ epochs.
  • ...and 2 more figures