Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

Ben Yao; Yazhou Zhang; Qiuchi Li; Jing Qin

Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

Ben Yao, Yazhou Zhang, Qiuchi Li, Jing Qin

TL;DR

This work interrogates whether sarcasm detection in LLMs relies on step-by-step reasoning and introduces SarcasmCue, a four-part prompting framework (CoC, GoC, BoC, ToC) that leverages sequential and non-sequential cues. Across four sarcasm benchmarks and multiple LLMs, CoC/GoC excel with larger models while ToC delivers the largest gains for smaller models, achieving state-of-the-art F1 improvements (4.2%, 2.0%, 29.7%, 58.2%). The framework uses chain contradictions, graph-based cue selection, ensemble cueing, and tensor fusion to model high-order cue interactions, and demonstrates robustness across zero-shot and few-shot settings. It also extends to humor detection, suggesting broad applicability of cue-based prompting strategies for affective language understanding in NLP aria.

Abstract

Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive understanding, in a way that does not necessarily follow a step-by-step fashion. To verify the validity of this argument, we introduce a new prompting framework (called SarcasmCue) containing four sub-methods, viz. chain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and tensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering sequential and non-sequential prompting methods. Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) CoC and GoC show superior performance with more advanced models like GPT-4 and Claude 3.5, with an improvement of 3.5%. (2) ToC significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.

Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 5 figures, 4 tables)

This paper contains 19 sections, 1 equation, 5 figures, 4 tables.

Introduction
Related Work
Chain-of-Thought Prompting
Sarcasm Detection
The Proposed Framework: SarcasmCue
Task Definition
Chain of Contradiction
Graph of Cues
Bagging of Cues
Tensor of Cues
Experiments
Experiment Setups
Main Results
Ablation Study
Zero-shot v/s Few-shot Prompting
...and 4 more sections

Figures (5)

Figure 1: The comparison of the processes of mathematical reasoning and sarcasm detection.
Figure 2: An illustration of our SarcasmCue framework that consists of four prompting sub-methods.
Figure 3: The average Macro-F1 across K-shots for the GPT-4o and Claude 3.5 Sonnet models.
Figure 4: The influence of model scale. The figures in the top and bottom correspond to Qwen and Llama models, respectively.
Figure 5: The average error rate of the four prompting methods.

Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

TL;DR

Abstract

Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)