Do Large Language Models Understand Logic or Just Mimick Context?

Junbing Yan; Chengyu Wang; Jun Huang; Wei Zhang

Do Large Language Models Understand Logic or Just Mimick Context?

Junbing Yan, Chengyu Wang, Jun Huang, Wei Zhang

TL;DR

The paper investigates whether large language models genuinely understand logical rules or merely exploit contextual cues by applying counterfactual prompts that manipulate in-context Text, Reasoning Chain, Pattern, and Definitions. It uses LLaMA2 ($7B$–$70B$) and Qwen ($7B$–$200B$) across Folio, Entailment Bank, and MRC to test robustness to replacements and symbol modifications, finding that COT-style in-context prompts improve performance but do not imply true rule comprehension. Larger models show resilience to some perturbations yet still rely largely on probabilistic associations rather than solid logical understanding, with symbol-definition swaps inducing limited adaptation. The results highlight the limitations of in-context learning for robust logical reasoning and call for new training paradigms or mechanisms beyond surface-level prompt engineering.

Abstract

Over the past few years, the abilities of large language models (LLMs) have received extensive attention, which have performed exceptionally well in complicated scenarios such as logical reasoning and symbolic inference. A significant factor contributing to this progress is the benefit of in-context learning and few-shot prompting. However, the reasons behind the success of such models using contextual reasoning have not been fully explored. Do LLMs have understand logical rules to draw inferences, or do they ``guess'' the answers by learning a type of probabilistic mapping through context? This paper investigates the reasoning capabilities of LLMs on two logical reasoning datasets by using counterfactual methods to replace context text and modify logical concepts. Based on our analysis, it is found that LLMs do not truly understand logical rules; rather, in-context learning has simply enhanced the likelihood of these models arriving at the correct answers. If one alters certain words in the context text or changes the concepts of logical terms, the outputs of LLMs can be significantly disrupted, leading to counter-intuitive responses. This work provides critical insights into the limitations of LLMs, underscoring the need for more robust mechanisms to ensure reliable logical reasoning in LLMs.

Do Large Language Models Understand Logic or Just Mimick Context?

TL;DR

–

) and Qwen (

–

) across Folio, Entailment Bank, and MRC to test robustness to replacements and symbol modifications, finding that COT-style in-context prompts improve performance but do not imply true rule comprehension. Larger models show resilience to some perturbations yet still rely largely on probabilistic associations rather than solid logical understanding, with symbol-definition swaps inducing limited adaptation. The results highlight the limitations of in-context learning for robust logical reasoning and call for new training paradigms or mechanisms beyond surface-level prompt engineering.

Abstract

Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Introduction
Related Work
Large Language Models
Counterfactual Prompt
Logical Reasoning
Method
Experiment
Models
Datasets
Influence of In-Context Examples
Influence of Texts
Influence of Reasoning Chain
Influence of Pattern
Test for Logical Understanding Ability
Enhancing Logical Comprehension Ability for LLM
...and 1 more sections

Figures (4)

Figure 1: Tasks and datasets used in our experiment: Text: in blue color; Reasoning Chain: in orange color; Pattern: in purple color.
Figure 2: The impact of different replacement parts on Entailment Bank for Qwen series models' performance.
Figure 3: The impact of different replacement parts on Entailment Bank for LLaMA series models' performance.
Figure 4: Results of different scales of LLaMA and Qwen models over Entailment Bank when using different settings. Each target example has 4 in-context samples as the demonstration.

Do Large Language Models Understand Logic or Just Mimick Context?

TL;DR

Abstract

Do Large Language Models Understand Logic or Just Mimick Context?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)