Table of Contents
Fetching ...

On Sarcasm Detection with OpenAI GPT-based Models

Montgomery Gole, Williams-Paul Nwadiugwu, Andriy Miranskyy

TL;DR

This study addresses sarcasm detection by evaluating fourteen GPT-based models across fine-tuned and zero-shot settings on the pol-bal subset of SARC 2.0. It demonstrates that large fine-tuned GPT-3 models (notably davinci) attain state-of-the-art accuracy and $F_1$-scores around $0.81$, while zero-shot performance peaks with GPT-4 variants at about $Acc \\approx 0.70$ and $F_1 \\approx 0.75$, albeit with inconsistencies across releases. The work highlights that model performance can improve or deteriorate with new releases, emphasizing the need for reevaluation after each update. It also outlines a careful methodology combining prompt design, selective fine-tuning, zero-shot testing with logit bias, and rigorous McNemar-based significance testing, providing practical guidance for deploying LLM-based sarcasm detectors in real-world contexts. Overall, the findings suggest that while fine-tuned GPT-3 can surpass prior approaches on pol-bal, zero-shot sarcasm detection remains challenging and highly release-dependent, with implications for cost-effective deployment and ongoing model benchmarking in NLP tasks requiring nuanced social understanding.

Abstract

Sarcasm is a form of irony that requires readers or listeners to interpret its intended meaning by considering context and social cues. Machine learning classification models have long had difficulty detecting sarcasm due to its social complexity and contradictory nature. This paper explores the applications of the Generative Pretrained Transformer (GPT) models, including GPT-3, InstructGPT, GPT-3.5, and GPT-4, in detecting sarcasm in natural language. It tests fine-tuned and zero-shot models of different sizes and releases. The GPT models were tested on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC 2.0) sarcasm dataset. In the fine-tuning case, the largest fine-tuned GPT-3 model achieves accuracy and $F_1$-score of 0.81, outperforming prior models. In the zero-shot case, one of GPT-4 models yields an accuracy of 0.70 and $F_1$-score of 0.75. Other models score lower. Additionally, a model's performance may improve or deteriorate with each release, highlighting the need to reassess performance after each release.

On Sarcasm Detection with OpenAI GPT-based Models

TL;DR

This study addresses sarcasm detection by evaluating fourteen GPT-based models across fine-tuned and zero-shot settings on the pol-bal subset of SARC 2.0. It demonstrates that large fine-tuned GPT-3 models (notably davinci) attain state-of-the-art accuracy and -scores around , while zero-shot performance peaks with GPT-4 variants at about and , albeit with inconsistencies across releases. The work highlights that model performance can improve or deteriorate with new releases, emphasizing the need for reevaluation after each update. It also outlines a careful methodology combining prompt design, selective fine-tuning, zero-shot testing with logit bias, and rigorous McNemar-based significance testing, providing practical guidance for deploying LLM-based sarcasm detectors in real-world contexts. Overall, the findings suggest that while fine-tuned GPT-3 can surpass prior approaches on pol-bal, zero-shot sarcasm detection remains challenging and highly release-dependent, with implications for cost-effective deployment and ongoing model benchmarking in NLP tasks requiring nuanced social understanding.

Abstract

Sarcasm is a form of irony that requires readers or listeners to interpret its intended meaning by considering context and social cues. Machine learning classification models have long had difficulty detecting sarcasm due to its social complexity and contradictory nature. This paper explores the applications of the Generative Pretrained Transformer (GPT) models, including GPT-3, InstructGPT, GPT-3.5, and GPT-4, in detecting sarcasm in natural language. It tests fine-tuned and zero-shot models of different sizes and releases. The GPT models were tested on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC 2.0) sarcasm dataset. In the fine-tuning case, the largest fine-tuned GPT-3 model achieves accuracy and -score of 0.81, outperforming prior models. In the zero-shot case, one of GPT-4 models yields an accuracy of 0.70 and -score of 0.75. Other models score lower. Additionally, a model's performance may improve or deteriorate with each release, highlighting the need to reassess performance after each release.
Paper Structure (30 sections, 5 figures, 4 tables)

This paper contains 30 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Prompts used for fine-tuning and zero-shot GPT models. <reply> and <thread> placeholders represent template values that need to be replaced with actual content.
  • Figure 2: A zero-shot test prompt with two comments in a thread.
  • Figure 3: P-values of the pairwise McNemar's test of fine-tuned GPT-3 and GPT-3.5-turbo models; $p$-values $\ge 0.05$ are highlighted in red.
  • Figure 4: P-values of the pairwise McNemar's test for zero-shot experiments with bias using GPT-3.5-turbo and GPT-4 ChatGPT models; $p$-values $\ge 0.05$ are highlighted in red.
  • Figure 5: P-values of the pairwise McNemar's test for zero-shot experiments without bias using GPT-3.5-turbo and GPT-4 ChatGPT models; $p$-values $\ge 0.05$ are highlighted in red.