Assessing how hyperparameters impact Large Language Models' sarcasm detection performance
Montgomery Gole, Andriy Miranskyy
TL;DR
This study investigates how hyperparameters, model size, and versioning influence sarcasm detection in OpenAI GPT and Meta Llama-2 models using the pol-bal subset of the SARC2.0 dataset. It contrasts fine-tuned and zero-shot paradigms, finding that fine-tuned performance scales with model size and that Llama-2-13b can achieve near state-of-the-art metrics, while a top GPT-4 zero-shot model also approaches competitive levels. The authors conduct extensive hyperparameter sweeps for Llama-2 and perform rigorous statistical analyses, including McNemar tests and regression analyses, to identify which factors most affect performance. The work highlights the importance of re-evaluating model performance after each release, discusses interpretability insights via SHAP/IG, and outlines practical steps for applying LLMs to sarcasm detection in real-world contexts, along with clear avenues for future research. Overall, the paper provides a methodological framework for evaluating how model characteristics and hyperparameters shape sarcasm detection in large language models and demonstrates strong performance from fine-tuned Llama-2 variants on a challenging social-media dataset.
Abstract
Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and $F_1$-score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an $F_1$-score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.
