Table of Contents
Fetching ...

Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification

Branislav Pecher, Michal Spiegel, Robert Belanec, Jan Cegin

TL;DR

This work demonstrates that prompt underspecification largely drives observed prompt sensitivity in LLM-based text classification. By systematically contrasting underspecified prompts with well-specified instruction prompts across multiple tasks and LLaMA-family models, the authors show instruction prompts and in-context learning significantly stabilize performance and logits, while calibration and UNK-based strategies offer inconsistent or detrimental effects. Logit analysis reveals a strong link between label-token logits and accuracy, whereas linear probing indicates most internal representations remain robust to underspecification, with sensitivity manifesting mainly in final outputs. The findings advocate for rigorous, well-specified prompting and in-context learning as effective, non-invasive mitigation, with broader implications for evaluating and deploying prompt-based classifiers in practice.

Abstract

Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.

Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification

TL;DR

This work demonstrates that prompt underspecification largely drives observed prompt sensitivity in LLM-based text classification. By systematically contrasting underspecified prompts with well-specified instruction prompts across multiple tasks and LLaMA-family models, the authors show instruction prompts and in-context learning significantly stabilize performance and logits, while calibration and UNK-based strategies offer inconsistent or detrimental effects. Logit analysis reveals a strong link between label-token logits and accuracy, whereas linear probing indicates most internal representations remain robust to underspecification, with sensitivity manifesting mainly in final outputs. The findings advocate for rigorous, well-specified prompting and in-context learning as effective, non-invasive mitigation, with broader implications for evaluating and deploying prompt-based classifiers in practice.

Abstract

Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
Paper Structure (14 sections, 2 figures, 7 tables)

This paper contains 14 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Using underspecified prompts leads lower label token probabilities and issues with label extraction.
  • Figure 2: The mean accuracy of the linear probe model over different minimal and instruction prompt formats for all layers. Instruction prompt formats lead to slightly more informative representations.