Table of Contents
Fetching ...

Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models

Paulo Pirozelli, Marcos M. José, Paulo de Tarso P. Filho, Anarosa A. F. Brandão, Fabio G. Cozman

TL;DR

The paper questions whether encoder-only transformer models possess genuine logical reasoning abilities in propositional calculus and first-order logic. Through fine-tuning on four logical datasets (FOLIO, LogicNLI, RuleTaker, SimpleLogic) and hypothesis classification, it demonstrates strong task performance on some datasets but limited generalization across datasets. Cross-probing and layerwise analyses reveal that reasoning tends to emerge in higher layers and appears dominated by dataset-specific cues rather than robust, transferable logic. These findings suggest that observed logical performance arises from statistical features learned during data exposure rather than true deduction, highlighting the need for architectures or training paradigms that enforce general logical reasoning.

Abstract

Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer language models (LMs) can reason according to logical rules. We ask whether those LMs can deduce theorems in propositional calculus and first-order logic; if their relative success in these problems reflects general logical capabilities; and which layers contribute the most to the task. First, we show for several encoder-only LMs that they can be trained, to a reasonable degree, to determine logical validity on various datasets. Next, by cross-probing fine-tuned models on these datasets, we show that LMs have difficulty in transferring their putative logical reasoning ability, which suggests that they may have learned dataset-specific features, instead of a general capability. Finally, we conduct a layerwise probing experiment, which shows that the hypothesis classification task is mostly solved through higher layers.

Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models

TL;DR

The paper questions whether encoder-only transformer models possess genuine logical reasoning abilities in propositional calculus and first-order logic. Through fine-tuning on four logical datasets (FOLIO, LogicNLI, RuleTaker, SimpleLogic) and hypothesis classification, it demonstrates strong task performance on some datasets but limited generalization across datasets. Cross-probing and layerwise analyses reveal that reasoning tends to emerge in higher layers and appears dominated by dataset-specific cues rather than robust, transferable logic. These findings suggest that observed logical performance arises from statistical features learned during data exposure rather than true deduction, highlighting the need for architectures or training paradigms that enforce general logical reasoning.

Abstract

Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer language models (LMs) can reason according to logical rules. We ask whether those LMs can deduce theorems in propositional calculus and first-order logic; if their relative success in these problems reflects general logical capabilities; and which layers contribute the most to the task. First, we show for several encoder-only LMs that they can be trained, to a reasonable degree, to determine logical validity on various datasets. Next, by cross-probing fine-tuned models on these datasets, we show that LMs have difficulty in transferring their putative logical reasoning ability, which suggests that they may have learned dataset-specific features, instead of a general capability. Finally, we conduct a layerwise probing experiment, which shows that the hypothesis classification task is mostly solved through higher layers.
Paper Structure (17 sections, 5 figures, 5 tables)

This paper contains 17 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples of logical reasoning arguments. The argument at the top is from FOLIO, a manually-written dataset; the one at the bottom is from RuleTaker, a dataset that uses a semi-synthetic approach.
  • Figure 2: RoBERTa-large models fine-tuned on FOLIO, LogicNLI, RuleTaker, and SimpleLogic, and probed for the same datasets layerwise. The pretrained baselines are indicated by gray lines, while the values achieved in the cross-probing task are represented by black lines. The colored bars indicate the change in accuracy from the previous layers. Probing was performed with a 1-layer classifier and a learning rate of 1e-6.
  • Figure 3: RoBERTa-large models fine-tuned on FOLIO, LogicNLI, RuleTaker, and SimpleLogic, and probed for the same datasets layerwise. The pretrained baselines are indicated by gray lines, while the values achieved in the cross-probing task are represented by black lines. The colored bars indicate the change in accuracy from the previous layers. Probing was performed with a 3-layer classifier and a learning rate of 1e-6.
  • Figure 4: RoBERTa-large models fine-tuned on FOLIO, LogicNLI, RuleTaker, and SimpleLogic, and probed for the same datasets layerwise. The pretrained baselines are indicated by gray lines, while the values achieved in the cross-probing task are represented by black lines. The colored bars indicate the change in accuracy from the previous layers. Probing was performed with a 1-layer classifier and a learning rate of 1e-5.
  • Figure 5: RoBERTa-large models fine-tuned on FOLIO, LogicNLI, RuleTaker, and SimpleLogic, and probed for the same datasets layerwise. The pretrained baselines are indicated by gray lines, while the values achieved in the cross-probing task are represented by black lines. The colored bars indicate the change in accuracy from the previous layers. Probing was performed with a 3-layer classifier and a learning rate of 1e-5.