Table of Contents
Fetching ...

If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

Jasmin Orth, Philipp Mondorf, Barbara Plank

TL;DR

This study investigates how large language models judge the acceptability of conditional statements, examining whether judgments track probabilistic strength $P(B \mid A)$ and the semantic relevance between antecedent and consequent. Using a dataset of 144 conditionals across 12 contexts and four model families with varying prompting, the authors analyze judgments via linear mixed-effects models and ANOVA, revealing that both probabilistic and relevancy cues influence LLM judgments but with substantial variability across architecture and prompts. Relative to humans, LLMs approximate general trends yet show weaker, less systematic integration of probability and relevance, with larger models not necessarily aligning more closely to human judgments. The findings highlight the nuanced role of prompting and model size in shaping conditional reasoning, and underscore the need for richer diagnostics to better understand human–LLM alignment in conditional acceptability. The work contributes to understanding how probabilistic and evidential cues are integrated by LLMs in a context where human judgments are known to depend on both coherence and causal relevance, with implications for natural language understanding, explanation, and argumentation in AI systems.

Abstract

Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.

If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

TL;DR

This study investigates how large language models judge the acceptability of conditional statements, examining whether judgments track probabilistic strength and the semantic relevance between antecedent and consequent. Using a dataset of 144 conditionals across 12 contexts and four model families with varying prompting, the authors analyze judgments via linear mixed-effects models and ANOVA, revealing that both probabilistic and relevancy cues influence LLM judgments but with substantial variability across architecture and prompts. Relative to humans, LLMs approximate general trends yet show weaker, less systematic integration of probability and relevance, with larger models not necessarily aligning more closely to human judgments. The findings highlight the nuanced role of prompting and model size in shaping conditional reasoning, and underscore the need for richer diagnostics to better understand human–LLM alignment in conditional acceptability. The work contributes to understanding how probabilistic and evidential cues are integrated by LLMs in a context where human judgments are known to depend on both coherence and causal relevance, with implications for natural language understanding, explanation, and argumentation in AI systems.

Abstract

Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the of given , and the of the antecedent given the consequent (i.e., whether meaningfully supports ). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.

Paper Structure

This paper contains 47 sections, 12 figures, 20 tables.

Figures (12)

  • Figure 1: Illustration of two conditionals with equally high conditional probabilities but differing evidential relevance. While both are probable, only the left one encodes a plausible causal or evidential link between antecedent and consequent, appearing more acceptable.
  • Figure 2: Example of three conditional statements representing different relations.
  • Figure 3: Example of a conditional presented in full form (top) and its corresponding split version (bottom), used to elicit different types of model judgments.
  • Figure 4: Distribution of center-scaled judgments across humans and LLMs. Top row: overall distributions. Bottom row: distributions divided by relation types. Left: Llama 70B (vanilla), center: Qwen 72B (vanilla), right: humans.
  • Figure 5: Mean and standard deviation across relevance conditions and vanilla models (red: Llama 8B, green: Llama 70B, blue: Qwen 7B, purple: Qwen 72B).
  • ...and 7 more figures