Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics
Fangru Lin, Daniel Altshuler, Janet B. Pierrehumbert
TL;DR
This paper systematically probes large language models for scalar adjective lexical semantics and scalar implicature-based pragmatic reasoning (scalar diversity). Using direct and indirect probing across datasets of scalar adjectives, the authors show robust lexical-semantic knowledge but limited ability to model scalar diversity, with larger models not consistently yielding better performance. The study compares encoder and decoder architectures (e.g., BERT, RoBERTa, Flan-T5, Falcon, GPT-4) and reports that some smaller or differently trained models outperform larger ones in pragmatic tasks, highlighting the role of training objectives and prompting strategies. The findings suggest that strengthening pragmatic inferences like scalar implicatures in LLMs requires targeted training signals beyond lexical-semantic proficiency, guiding future work in aligning semantic and pragmatic capabilities for NLP applications.
Abstract
Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.
