On Sarcasm Detection with OpenAI GPT-based Models
Montgomery Gole, Williams-Paul Nwadiugwu, Andriy Miranskyy
TL;DR
This study addresses sarcasm detection by evaluating fourteen GPT-based models across fine-tuned and zero-shot settings on the pol-bal subset of SARC 2.0. It demonstrates that large fine-tuned GPT-3 models (notably davinci) attain state-of-the-art accuracy and $F_1$-scores around $0.81$, while zero-shot performance peaks with GPT-4 variants at about $Acc \\approx 0.70$ and $F_1 \\approx 0.75$, albeit with inconsistencies across releases. The work highlights that model performance can improve or deteriorate with new releases, emphasizing the need for reevaluation after each update. It also outlines a careful methodology combining prompt design, selective fine-tuning, zero-shot testing with logit bias, and rigorous McNemar-based significance testing, providing practical guidance for deploying LLM-based sarcasm detectors in real-world contexts. Overall, the findings suggest that while fine-tuned GPT-3 can surpass prior approaches on pol-bal, zero-shot sarcasm detection remains challenging and highly release-dependent, with implications for cost-effective deployment and ongoing model benchmarking in NLP tasks requiring nuanced social understanding.
Abstract
Sarcasm is a form of irony that requires readers or listeners to interpret its intended meaning by considering context and social cues. Machine learning classification models have long had difficulty detecting sarcasm due to its social complexity and contradictory nature. This paper explores the applications of the Generative Pretrained Transformer (GPT) models, including GPT-3, InstructGPT, GPT-3.5, and GPT-4, in detecting sarcasm in natural language. It tests fine-tuned and zero-shot models of different sizes and releases. The GPT models were tested on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC 2.0) sarcasm dataset. In the fine-tuning case, the largest fine-tuned GPT-3 model achieves accuracy and $F_1$-score of 0.81, outperforming prior models. In the zero-shot case, one of GPT-4 models yields an accuracy of 0.70 and $F_1$-score of 0.75. Other models score lower. Additionally, a model's performance may improve or deteriorate with each release, highlighting the need to reassess performance after each release.
