Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Fangru Lin; Daniel Altshuler; Janet B. Pierrehumbert

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Fangru Lin, Daniel Altshuler, Janet B. Pierrehumbert

TL;DR

This paper systematically probes large language models for scalar adjective lexical semantics and scalar implicature-based pragmatic reasoning (scalar diversity). Using direct and indirect probing across datasets of scalar adjectives, the authors show robust lexical-semantic knowledge but limited ability to model scalar diversity, with larger models not consistently yielding better performance. The study compares encoder and decoder architectures (e.g., BERT, RoBERTa, Flan-T5, Falcon, GPT-4) and reports that some smaller or differently trained models outperform larger ones in pragmatic tasks, highlighting the role of training objectives and prompting strategies. The findings suggest that strengthening pragmatic inferences like scalar implicatures in LLMs requires targeted training signals beyond lexical-semantic proficiency, guiding future work in aligning semantic and pragmatic capabilities for NLP applications.

Abstract

Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

TL;DR

Abstract

Paper Structure (43 sections, 3 equations, 4 figures, 15 tables)

This paper contains 43 sections, 3 equations, 4 figures, 15 tables.

Introduction
Related works
Scalar Adjective Lexical Semantics
Scalar Implicature and Scalar Diversity Pragmatics
Direct and Indirect Probing
Probing Lexical Semantics
Datasets for Lexical-Semantic Probing
Scalar Adjective Datasets
Context Sentence Datasets
Probing Scale Membership
Scale Membership Direct Probing Method
Scale Membership Direct Probing Experiment and Results
Scale Membership Indirect Probing Method
Scale Membership Indirect Probing Experiment and Results
Probing Scalar Intensity
...and 28 more sections

Figures (4)

Figure 1: People often mean more than what they literally say. Humans can easily infer implied messages, while LLMs often fail to do so.
Figure 2: The process to derive intensity vector $\vec{d_{vec}}$. First, an adjective half-scale is randomly shuffled ten times for the order of adjectives as inputs to a language model. Then the encoded word vectors for the same word in different inputs are conducted with the Hadamard mean to derive the final representation of the word. After that, intensity vector $\vec{d_{vec}}$ is calculated by subtracting layer-wise representation of the weakest adjective from the strongest adjective ($\vec{awesome}-\vec{good}$ in this case) then averaging over all relevant half-scale subtractions in a dataset. Then layer-wise $\vec{d_{vec}}$ is used to probe language models' knowledge for adjective intensities.
Figure 3: Attention visualization by Bertviz vig-2019-multiscale. Attention head 10 in the last layer of RoBERTa-b picks up good, great, wonderful, awesome when computing good in the context of 'A is good. B is awesome. C is wonderful. D is great.'
Figure 4: Free generation results for GPT-4 using a prompt from GZ without forcing yes or no answers.

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

TL;DR

Abstract

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Authors

TL;DR

Abstract

Table of Contents

Figures (4)