Table of Contents
Fetching ...

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Andrew Katz

Abstract

The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Abstract

The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.
Paper Structure (96 sections, 9 equations, 11 figures, 8 tables)

This paper contains 96 sections, 9 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Surprisal curves for "bug" (insect meaning) on the social, ecological, and technological dimensions with moderate context ("The garden bug pollinated flowers while feeding on nectar"). The 14B models correctly identify high ecological scores.
  • Figure 2: Surprisal curves for "bug" (software meaning) on the social, ecological, and technological dimensions with moderate context ("The software bug caused the application to crash unexpectedly"). The 14B models correctly identify high technological scores.
  • Figure 3: Surprisal curves for "virus" (computer meaning) across three context levels: (a) minimal context, where models interpret "virus" as biological and assign high ecological scores; (b) moderate context, where 14B models correctly shift to high technological scores; (c) rich context describing malware behavior, indicating context-dependent disambiguation. The 3B model never adjusts its technological score regardless of context.
  • Figure 4: Binary classification surprisal for clear cases: (a) the causal statement "The heavy rain caused widespread flooding in the city," where upward-sloping lines indicate lower surprisal for "True"; (b) the non-causal statement "The meeting was scheduled for 3 PM," where downward-sloping lines indicate lower surprisal for "False." Each panel shows one model with three context levels.
  • Figure 5: Binary classification surprisal for ambiguous cases: (a) the indirect causal statement "If you heat water to 100 degrees Celsius, it will boil," where lines slope upward but less steeply than clear cases; (b) the correlational statement "Students who study more tend to get better grades," where mixed slopes reflect genuine ambiguity. Compare with Figure \ref{['fig:causal-binary-clear']}.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 1: Completion-Based Surprisal