Table of Contents
Fetching ...

On Prompt Sensitivity of ChatGPT in Affective Computing

Mostafa M. Amin, Björn W. Schuller

TL;DR

This work presents a framework to systematically evaluate prompt sensitivity and generation parameter effects for foundation models, applied to ChatGPT in affective computing tasks. By analyzing $T$ and $top\text{-}p$ alongside a diverse set of prompts, and using a Monte Carlo approach across sentiment, toxicity, and sarcasm detection, the study reveals when conservative generation and certain prompts yield stable, parseable outputs. Key findings show that lower $T$ and smaller $top\text{-}p$ generally improve performance and reliability, while simple expert-based prompts often perform near the top, whereas Chain-of-Thought prompts can improve some tasks but hinder parsing. The results offer practical guidance for deploying LLMs in downstream affective computing applications and highlight the need to balance performance with ease of parseability and task-specific idiosyncrasies.

Abstract

Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter $T$ and the top-$p$ parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.

On Prompt Sensitivity of ChatGPT in Affective Computing

TL;DR

This work presents a framework to systematically evaluate prompt sensitivity and generation parameter effects for foundation models, applied to ChatGPT in affective computing tasks. By analyzing and alongside a diverse set of prompts, and using a Monte Carlo approach across sentiment, toxicity, and sarcasm detection, the study reveals when conservative generation and certain prompts yield stable, parseable outputs. Key findings show that lower and smaller generally improve performance and reliability, while simple expert-based prompts often perform near the top, whereas Chain-of-Thought prompts can improve some tasks but hinder parsing. The results offer practical guidance for deploying LLMs in downstream affective computing applications and highlight the need to balance performance with ease of parseability and task-specific idiosyncrasies.

Abstract

Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter and the top- parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.
Paper Structure (17 sections, 3 equations, 2 figures, 2 tables)

This paper contains 17 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Sensitivity analysis for the temperature parameter $T$ using the Expert Detailed and CoT prompts. Shown are the classification accuracies with their 95 % confidence intervals on all problems. The values $T \in \{0.0, 0.3, 0.7, 1.0, 1.2, 1.5\}$ are examined.
  • Figure 2: Sensitivity analysis for the top-$p$ parameter using the Expert Detailed and CoT prompts. Shown are the classification accuracies with their 95 % confidence intervals on all problems. The values top-$p \in \{0.0, 0.3, 0.5, 0.7, 1.0\}$ are explored.