Table of Contents
Fetching ...

Assessing how hyperparameters impact Large Language Models' sarcasm detection performance

Montgomery Gole, Andriy Miranskyy

TL;DR

This study investigates how hyperparameters, model size, and versioning influence sarcasm detection in OpenAI GPT and Meta Llama-2 models using the pol-bal subset of the SARC2.0 dataset. It contrasts fine-tuned and zero-shot paradigms, finding that fine-tuned performance scales with model size and that Llama-2-13b can achieve near state-of-the-art metrics, while a top GPT-4 zero-shot model also approaches competitive levels. The authors conduct extensive hyperparameter sweeps for Llama-2 and perform rigorous statistical analyses, including McNemar tests and regression analyses, to identify which factors most affect performance. The work highlights the importance of re-evaluating model performance after each release, discusses interpretability insights via SHAP/IG, and outlines practical steps for applying LLMs to sarcasm detection in real-world contexts, along with clear avenues for future research. Overall, the paper provides a methodological framework for evaluating how model characteristics and hyperparameters shape sarcasm detection in large language models and demonstrates strong performance from fine-tuned Llama-2 variants on a challenging social-media dataset.

Abstract

Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and $F_1$-score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an $F_1$-score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.

Assessing how hyperparameters impact Large Language Models' sarcasm detection performance

TL;DR

This study investigates how hyperparameters, model size, and versioning influence sarcasm detection in OpenAI GPT and Meta Llama-2 models using the pol-bal subset of the SARC2.0 dataset. It contrasts fine-tuned and zero-shot paradigms, finding that fine-tuned performance scales with model size and that Llama-2-13b can achieve near state-of-the-art metrics, while a top GPT-4 zero-shot model also approaches competitive levels. The authors conduct extensive hyperparameter sweeps for Llama-2 and perform rigorous statistical analyses, including McNemar tests and regression analyses, to identify which factors most affect performance. The work highlights the importance of re-evaluating model performance after each release, discusses interpretability insights via SHAP/IG, and outlines practical steps for applying LLMs to sarcasm detection in real-world contexts, along with clear avenues for future research. Overall, the paper provides a methodological framework for evaluating how model characteristics and hyperparameters shape sarcasm detection in large language models and demonstrates strong performance from fine-tuned Llama-2 variants on a challenging social-media dataset.

Abstract

Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and -score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an -score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.

Paper Structure

This paper contains 52 sections, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Prompts used for fine-tuning and zero-shot testing models. <reply> and <thread> placeholders represent template values that need to be replaced with actual content. (word1|word2) denotes that word1 was used in the prompt for GPT experiments, while word2 was used in the prompt for Llama-2 experiments.
  • Figure 2: A GPT zero-shot test prompt with two comments in a thread.
  • Figure 3: A Llama-2 zero-shot test prompt with two comments in a thread.
  • Figure 4: Pairwise McNemar's test of fine-tuned base GPT-3 models.
  • Figure 5: fine-tune results, comparing accuracy to parameter count. The top performing Llama-2-7b and Llama-2-13b models are shown, along with each fine-tuned GPT-3 model.
  • ...and 16 more figures