Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?
Ben Yao, Yazhou Zhang, Qiuchi Li, Jing Qin
TL;DR
This work interrogates whether sarcasm detection in LLMs relies on step-by-step reasoning and introduces SarcasmCue, a four-part prompting framework (CoC, GoC, BoC, ToC) that leverages sequential and non-sequential cues. Across four sarcasm benchmarks and multiple LLMs, CoC/GoC excel with larger models while ToC delivers the largest gains for smaller models, achieving state-of-the-art F1 improvements (4.2%, 2.0%, 29.7%, 58.2%). The framework uses chain contradictions, graph-based cue selection, ensemble cueing, and tensor fusion to model high-order cue interactions, and demonstrates robustness across zero-shot and few-shot settings. It also extends to humor detection, suggesting broad applicability of cue-based prompting strategies for affective language understanding in NLP aria.
Abstract
Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive understanding, in a way that does not necessarily follow a step-by-step fashion. To verify the validity of this argument, we introduce a new prompting framework (called SarcasmCue) containing four sub-methods, viz. chain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and tensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering sequential and non-sequential prompting methods. Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) CoC and GoC show superior performance with more advanced models like GPT-4 and Claude 3.5, with an improvement of 3.5%. (2) ToC significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.
