Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification
Branislav Pecher, Michal Spiegel, Robert Belanec, Jan Cegin
TL;DR
This work demonstrates that prompt underspecification largely drives observed prompt sensitivity in LLM-based text classification. By systematically contrasting underspecified prompts with well-specified instruction prompts across multiple tasks and LLaMA-family models, the authors show instruction prompts and in-context learning significantly stabilize performance and logits, while calibration and UNK-based strategies offer inconsistent or detrimental effects. Logit analysis reveals a strong link between label-token logits and accuracy, whereas linear probing indicates most internal representations remain robust to underspecification, with sensitivity manifesting mainly in final outputs. The findings advocate for rigorous, well-specified prompting and in-context learning as effective, non-invasive mitigation, with broader implications for evaluating and deploying prompt-based classifiers in practice.
Abstract
Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
