Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Abulhair Saparov; Richard Yuanzhe Pang; Vishakh Padmakumar; Nitish Joshi; Seyed Mehran Kazemi; Najoung Kim; He He

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He

TL;DR

The study tackles whether large language models can generalize general deductive reasoning beyond their in-context demonstrations by introducing PrOntoQA-OOD, a programmable dataset that controls deduction rules, proof depth, width, and compositional structure. Across four LLMs and 8-shot chain-of-thought prompting, the results show propensity for compositional generalization but notable difficulty with longer proofs and certain hypothetical subproofs that require explicit demonstrations. The work reveals that diverse, simple, and sometimes rule-specific demonstrations can improve OOD generalization, and distractors can aid robust reasoning, highlighting distinctions between in-context learning and supervised training. Overall, the paper provides a rigorous framework for evaluating OOD deductive reasoning in LLMs and points to directions for improving ICL mechanisms and data design to better capture general reasoning capabilities.

Abstract

Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 17 figures, 2 tables, 4 algorithms)

This paper contains 34 sections, 1 equation, 17 figures, 2 tables, 4 algorithms.

Introduction
Related work
OOD generalization of LLMs.
Evaluating reasoning abilities of LLMs.
Understanding in-context learning.
Approach
A programmable dataset.
Generating proofs with a complete set of deduction rules.
Varying proof width and depth.
Generating compositional proofs.
Adding distractors.
Formal evaluation of chain-of-thought.
Results
Can LLMs use deduction rules other than modus ponens?
Out-of-demonstration generalization
...and 19 more sections

Figures (17)

Figure 1: An overview of the kinds of OOD generalization that we test in our experiments. Each training example is a sample CoT demonstration provided to the LLM in the few-shot prompt, whereas each test example is a sample proof that the model is expected to output.
Figure 2: An example of a compositional proof containing modus ponens, proof by contradiction, and conjunction introduction, shown in both natural language and a formal tree representation.
Figure 3: An overview and properties of the LLMs in our experiments. We place an asterisk* for GPT-3.5 since we were not able to verify its size.
Figure 4: (top) Proof accuracy across examples with different deduction rules. The in-context examples and test examples come from the same distribution. (bottom) Change in proof accuracy, where the test example is out-of-demonstration with respect to the in-context examples. That is, the test example has the specified deduction rule, but the in-context examples are uniformly distributed over all other deduction rules. See Figure \ref{['fig:rule_accuracies_absolute']} in the Appendix for the equivalent plot with absolute proof accuracy on the y-axis. See Figure \ref{['fig:ood_or_elim_error_example']} for an incorrect example. Implication elimination examples have proof width of $1$ and depth of $2$. Conjunction introduction, conjunction elimination, and disjunction introduction examples have proof width $3$ and depth $2$. Disjunction elimination examples have proof width $3$ and depth $1$. Proof by contradiction examples have proof width $2$ and depth $1$.
Figure 5: Example of an incorrect proof generated by GPT-3.5 on an out-of-demonstration disjunction elimination example. The premises (axioms) are given in blue, and invalid steps are given in red. For the full example, see Figure \ref{['fig:ood_or_elim_error_example_full']} in the Appendix.
...and 12 more figures

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

TL;DR

Abstract

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Authors

TL;DR

Abstract

Table of Contents

Figures (17)