Table of Contents
Fetching ...

Are Humans as Brittle as Large Language Models?

Jiahui Li, Sean Papay, Roman Klinger

TL;DR

This study investigates whether humans exhibit prompt brittleness comparable to large language models (LLMs). It introduces a systematic prompt perturbation framework that classifies changes as neutral or sensitive and applies them to both LLMs and human annotators across four text classification tasks, quantifying distributional shifts with $Jensen ext{-}Shannon$ divergence. The findings show that both humans and LLMs are sensitive to certain prompt changes—especially those altering label sets or formats—with LLMs generally more brittle, though humans display notable effects on specific perturbations like Emo-related wording. Alignment between human and LLM outputs improves when both are prompted identically, and model size influences robustness. The work highlights practical implications for prompt design and annotation protocols, while outlining directions for future research on decoding dynamics and broader model inclusion.

Abstract

The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

Are Humans as Brittle as Large Language Models?

TL;DR

This study investigates whether humans exhibit prompt brittleness comparable to large language models (LLMs). It introduces a systematic prompt perturbation framework that classifies changes as neutral or sensitive and applies them to both LLMs and human annotators across four text classification tasks, quantifying distributional shifts with divergence. The findings show that both humans and LLMs are sensitive to certain prompt changes—especially those altering label sets or formats—with LLMs generally more brittle, though humans display notable effects on specific perturbations like Emo-related wording. Alignment between human and LLM outputs improves when both are prompted identically, and model size influences robustness. The work highlights practical implications for prompt design and annotation protocols, while outlining directions for future research on decoding dynamics and broader model inclusion.

Abstract

The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

Paper Structure

This paper contains 21 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Heatmaps showing distance scores between the distributions of LLM predictions with prompt variants and with the base prompt for four evaluated tasks in Table \ref{['tab:dataset_stats']}. Distance scores are calculated using Jensen--Shannon divergence dagan-etal-1997-similarity. The x-axis represents different types of prompt modifications defined in Section \ref{['sec:method']}, with numeric indices representing multiple instances of the same type-variant pair. The y-axis lists the five LLMs evaluated in the study. Black dashed lines divide each heatmap into three parts: prompt modifications belonging to the neutral class (left), those from the sensitive class (middle), and the average distance scores for neutral and sensitive class modifications respectively (right). All prompts evaluated are provided in Appendix \ref{['app:prompts']}.
  • Figure 2: Distance scores between the distributions of responses with prompt variants and the base prompt. The evaluated tasks are offensiveness rating (a) and emotion classification (b). The x-axis refers to the prompts with their modification types, and the y-axis refers to five LLMs and human samples. Distance scores are measured by Jensen--Shannon divergence dagan-etal-1997-similarity. Evaluated prompts can be found in Table \ref{['tab:prompt-off']} (a) and Table \ref{['tab:prompt-emo']} (b).
  • Figure 3: Distributional distance between LLM-generated and human annotations across five LLMs. The evaluated tasks are (a) offensiveness rating and (b) emotion classification. The x-axis and y-axis denote the prompt variants for human and LLMs respectively. The model name is displayed on the top of each subfigure. Evaluated prompts can be found in Table \ref{['tab:prompt-off']} for (a) and Table \ref{['tab:prompt-emo']} for (b).
  • Figure 4: Heatmaps showing Spearman's rank correlation coefficients of distance scores for prompt perturbations across five LLMs. The four evaluated tasks are (a) offensiveness rating, (b) politeness rating, (c) emotion classification, and (d) irony detection. We present the results for prompt perturbations in the neutral and sensitive categories introduced in Section \ref{['sec:method']}. Labels of x-axis and y-axis both refer to the names of LLMs. The coefficient values have a range from $-1$ to $1$, where $1$ means perfect monotonic increasing correlation (ranks agree exactly), -$1$ means monotonic decreasing correlation (ranks are opposites), $0$ means no monotonic relationship.
  • Figure 9: A screenshot of the instruction page for offensiveness rating survey.
  • ...and 2 more figures