Table of Contents
Fetching ...

Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!

Frank Wildenburg, Michael Hanna, Sandro Pezzelle

TL;DR

This work tackles semantic underspecification in language by introducing the DUST dataset, which covers Type 1–3 underspecifications (Type 4 excluded) with 2,123 underspecified and 2,123 specified sentences drawn from LAVA, CLAIRE, and Winograd sources. It evaluates pre-trained LMs on two fronts: detecting underspecification via a perplexity-based metalinguistic prompt and interpreting underspecified sentences in naturalistic continuations, revealing that newer models can flag underspecification when prompted but struggle to interpret underspecified content without explicit guidance. The findings reveal systematic limitations in current LMs' handling of sentence semantics, with performance highly sensitive to prompt design and evaluation setup, and they underscore the importance of naturalistic data and communicative context in assessing language capabilities. Together, the results motivate more robust evaluation frameworks and dataset expansion to better capture the nuances of underspecification in real-world language use.

Abstract

In everyday language use, speakers frequently utter and interpret sentences that are semantically underspecified, namely, whose content is insufficient to fully convey their message or interpret them univocally. For example, to interpret the underspecified sentence "Don't spend too much", which leaves implicit what (not) to spend, additional linguistic context or outside knowledge is needed. In this work, we propose a novel Dataset of semantically Underspecified Sentences grouped by Type (DUST) and use it to study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. We find that newer LMs are reasonably able to identify underspecified sentences when explicitly prompted. However, interpreting them correctly is much harder for any LMs. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict. Overall, our study reveals limitations in current models' processing of sentence semantics and highlights the importance of using naturalistic data and communicative scenarios when evaluating LMs' language capabilities.

Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!

TL;DR

This work tackles semantic underspecification in language by introducing the DUST dataset, which covers Type 1–3 underspecifications (Type 4 excluded) with 2,123 underspecified and 2,123 specified sentences drawn from LAVA, CLAIRE, and Winograd sources. It evaluates pre-trained LMs on two fronts: detecting underspecification via a perplexity-based metalinguistic prompt and interpreting underspecified sentences in naturalistic continuations, revealing that newer models can flag underspecification when prompted but struggle to interpret underspecified content without explicit guidance. The findings reveal systematic limitations in current LMs' handling of sentence semantics, with performance highly sensitive to prompt design and evaluation setup, and they underscore the importance of naturalistic data and communicative context in assessing language capabilities. Together, the results motivate more robust evaluation frameworks and dataset expansion to better capture the nuances of underspecification in real-world language use.

Abstract

In everyday language use, speakers frequently utter and interpret sentences that are semantically underspecified, namely, whose content is insufficient to fully convey their message or interpret them univocally. For example, to interpret the underspecified sentence "Don't spend too much", which leaves implicit what (not) to spend, additional linguistic context or outside knowledge is needed. In this work, we propose a novel Dataset of semantically Underspecified Sentences grouped by Type (DUST) and use it to study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. We find that newer LMs are reasonably able to identify underspecified sentences when explicitly prompted. However, interpreting them correctly is much harder for any LMs. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict. Overall, our study reveals limitations in current models' processing of sentence semantics and highlights the importance of using naturalistic data and communicative scenarios when evaluating LMs' language capabilities.
Paper Structure (34 sections, 7 figures, 5 tables)

This paper contains 34 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: In experiment 1 (top), we test if LMs distinguish underspecified sentences (S1) from more specified counterparts (S2) when explicitly prompted. In experiment 2 (bottom), we test if LMs correctly interpret S1 and S2 in a more naturalistic communicative setting.
  • Figure 2: Proportion of specification-matched inputs that receive lower perplexity than their specification-mismatched counterparts; higher is better. Asterisks (*) indicate performance significantly ($p < .05$) different from chance (50%). The leftmost column, Overall, reports the proportion computed over the whole dataset.
  • Figure 3: Log difference of perplexity of correct and incorrect continuations for underspecified (blue) and more specified (orange) sentences, averaged over prompts. The underspecified condition is not significantly lower than the more specified condition for any model.
  • Figure 4: Proportion of inputs where the plausible continuation to a specified sentence received lower perplexity; higher is better. Asterisks (*) indicate performance that differs significantly ($p < .05$) from chance (50%).
  • Figure 5: Proportion of specification-matched inputs with lower perplexity than their specification-mismatched counterparts; higher is better. Asterisks (*) indicate performance significantly ($p < .05$) above or below chance (50%).
  • ...and 2 more figures