Table of Contents
Fetching ...

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Fangru Lin, Daniel Altshuler, Janet B. Pierrehumbert

TL;DR

This paper systematically probes large language models for scalar adjective lexical semantics and scalar implicature-based pragmatic reasoning (scalar diversity). Using direct and indirect probing across datasets of scalar adjectives, the authors show robust lexical-semantic knowledge but limited ability to model scalar diversity, with larger models not consistently yielding better performance. The study compares encoder and decoder architectures (e.g., BERT, RoBERTa, Flan-T5, Falcon, GPT-4) and reports that some smaller or differently trained models outperform larger ones in pragmatic tasks, highlighting the role of training objectives and prompting strategies. The findings suggest that strengthening pragmatic inferences like scalar implicatures in LLMs requires targeted training signals beyond lexical-semantic proficiency, guiding future work in aligning semantic and pragmatic capabilities for NLP applications.

Abstract

Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

TL;DR

This paper systematically probes large language models for scalar adjective lexical semantics and scalar implicature-based pragmatic reasoning (scalar diversity). Using direct and indirect probing across datasets of scalar adjectives, the authors show robust lexical-semantic knowledge but limited ability to model scalar diversity, with larger models not consistently yielding better performance. The study compares encoder and decoder architectures (e.g., BERT, RoBERTa, Flan-T5, Falcon, GPT-4) and reports that some smaller or differently trained models outperform larger ones in pragmatic tasks, highlighting the role of training objectives and prompting strategies. The findings suggest that strengthening pragmatic inferences like scalar implicatures in LLMs requires targeted training signals beyond lexical-semantic proficiency, guiding future work in aligning semantic and pragmatic capabilities for NLP applications.

Abstract

Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.
Paper Structure (43 sections, 3 equations, 4 figures, 15 tables)

This paper contains 43 sections, 3 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: People often mean more than what they literally say. Humans can easily infer implied messages, while LLMs often fail to do so.
  • Figure 2: The process to derive intensity vector $\vec{d_{vec}}$. First, an adjective half-scale is randomly shuffled ten times for the order of adjectives as inputs to a language model. Then the encoded word vectors for the same word in different inputs are conducted with the Hadamard mean to derive the final representation of the word. After that, intensity vector $\vec{d_{vec}}$ is calculated by subtracting layer-wise representation of the weakest adjective from the strongest adjective ($\vec{awesome}-\vec{good}$ in this case) then averaging over all relevant half-scale subtractions in a dataset. Then layer-wise $\vec{d_{vec}}$ is used to probe language models' knowledge for adjective intensities.
  • Figure 3: Attention visualization by Bertviz vig-2019-multiscale. Attention head 10 in the last layer of RoBERTa-b picks up good, great, wonderful, awesome when computing good in the context of 'A is good. B is awesome. C is wonderful. D is great.'
  • Figure 4: Free generation results for GPT-4 using a prompt from GZ without forcing yes or no answers.