If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth, Philipp Mondorf, Barbara Plank
TL;DR
This study investigates how large language models judge the acceptability of conditional statements, examining whether judgments track probabilistic strength $P(B \mid A)$ and the semantic relevance between antecedent and consequent. Using a dataset of 144 conditionals across 12 contexts and four model families with varying prompting, the authors analyze judgments via linear mixed-effects models and ANOVA, revealing that both probabilistic and relevancy cues influence LLM judgments but with substantial variability across architecture and prompts. Relative to humans, LLMs approximate general trends yet show weaker, less systematic integration of probability and relevance, with larger models not necessarily aligning more closely to human judgments. The findings highlight the nuanced role of prompting and model size in shaping conditional reasoning, and underscore the need for richer diagnostics to better understand human–LLM alignment in conditional acceptability. The work contributes to understanding how probabilistic and evidential cues are integrated by LLMs in a context where human judgments are known to depend on both coherence and causal relevance, with implications for natural language understanding, explanation, and argumentation in AI systems.
Abstract
Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.
