Table of Contents
Fetching ...

Logical forms complement probability in understanding language model (and human) performance

Yixuan Wang, Freda Shi

TL;DR

This work investigates LLM logical reasoning beyond probability by constructing a controlled dataset of propositional and alethic modal logic syllogisms expressed in natural language. It systematically evaluates multiple open-weight LLMs and human participants using a probability-based soft accuracy metric, revealing that logical form and modality significantly influence performance alongside perplexity, with Diamond generally easier than Box. Through linear and generalized mixed-effects modeling, the study shows robust effects of Modality and ArgForm and a relatively weak but negative correlation with perplexity, while also documenting an affirmation bias in LLMs that varies by modality. Human data echo some patterns but diverge in others, underscoring both similarities and differences between machine and human reasoning. The findings advocate for incorporating logical-form structure into evaluation and planning frameworks and provide a publicly releasable dataset for further study of machine and human logical reasoning in natural language.

Abstract

With the increasing interest in using large language models (LLMs) for planning in natural language, understanding their behaviors becomes an important research question. This work conducts a systematic investigation of LLMs' ability to perform logical reasoning in natural language. We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic and use it as the testbed for understanding LLM performance. Our results lead to novel insights in predicting LLM behaviors: in addition to the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical forms should be considered as important factors. In addition, we show similarities and discrepancies between the logical reasoning performances of humans and LLMs by collecting and comparing behavioral data from both.

Logical forms complement probability in understanding language model (and human) performance

TL;DR

This work investigates LLM logical reasoning beyond probability by constructing a controlled dataset of propositional and alethic modal logic syllogisms expressed in natural language. It systematically evaluates multiple open-weight LLMs and human participants using a probability-based soft accuracy metric, revealing that logical form and modality significantly influence performance alongside perplexity, with Diamond generally easier than Box. Through linear and generalized mixed-effects modeling, the study shows robust effects of Modality and ArgForm and a relatively weak but negative correlation with perplexity, while also documenting an affirmation bias in LLMs that varies by modality. Human data echo some patterns but diverge in others, underscoring both similarities and differences between machine and human reasoning. The findings advocate for incorporating logical-form structure into evaluation and planning frameworks and provide a publicly releasable dataset for further study of machine and human logical reasoning in natural language.

Abstract

With the increasing interest in using large language models (LLMs) for planning in natural language, understanding their behaviors becomes an important research question. This work conducts a systematic investigation of LLMs' ability to perform logical reasoning in natural language. We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic and use it as the testbed for understanding LLM performance. Our results lead to novel insights in predicting LLM behaviors: in addition to the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical forms should be considered as important factors. In addition, we show similarities and discrepancies between the logical reasoning performances of humans and LLMs by collecting and comparing behavioral data from both.

Paper Structure

This paper contains 27 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the fact that perplexity does not serve as a reliable indicator of logical reasoning performance; and therefore, neither does probability. The distributions of the probabilities assigned to the ground-truth answer (i.e., soft accuracy; Y-axis) by Llama-3-70B are plotted against the perplexity of the corresponding example question (X-axis) and grouped by (a) modality, (b) argument forms, and (c) logic interpretation content. Each group consists of 20 randomly selected examples with other factors controlled.
  • Figure 2: The data synthesis pipeline: for each variable in logic forms (\ref{['subsec:syn-logic']}), we assign meanings to them to obtain the natural language question-answering pairs (\ref{['subsec:syn-natural-language']}).
  • Figure 3: Estimated marginal means of logical form factors in the mixed-effects model of \ref{['eqn: mixed-effects']}, along with their 95% confidence intervals.
  • Figure 4: Illustration of per-model random effects on soft accuracy in the mixed-effects model of \ref{['eqn: mixed-effects']} with 99.9% confidence intervals. (a) Mixed effects (i.e., the sum of fixed and random effects) of perplexity. (b) Intercept random effects (i.e., constant term per model on soft accuracy), with the model performance rank (\ref{['tab:softacc-base']}) annotated in parentheses.
  • Figure 5: Correlation between mean perplexity and mean confidence score on each logic sequent. Each point represents an average over a group of 1000 prompts that share the same underlying logic sequent. Two connected dots share the same logic formula.
  • ...and 2 more figures