Table of Contents
Fetching ...

Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, Yongfeng Zhang

TL;DR

ContextHub addresses whether large language models truly reason or rely on contextual cues by pairing abstract and contextualized instantiations of the same propositional logic templates. The authors construct a scalable benchmark with 4 difficulty levels, 12 domains plus an abstract domain, and rigorous quality control to study context effects and generalization. Key findings show that model size interacts with context, with large models excelling on abstract logic while contextualized data can substantially boost fine-tuning generalization, though highly complex tasks challenge contextualized approaches. The work provides a flexible, domain-aware framework for evaluating and improving reasoning in LLMs and highlights instantiated data as a powerful resource for generalization in practice.

Abstract

This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM's reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: https://github.com/agiresearch/ContextHub.

Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

TL;DR

ContextHub addresses whether large language models truly reason or rely on contextual cues by pairing abstract and contextualized instantiations of the same propositional logic templates. The authors construct a scalable benchmark with 4 difficulty levels, 12 domains plus an abstract domain, and rigorous quality control to study context effects and generalization. Key findings show that model size interacts with context, with large models excelling on abstract logic while contextualized data can substantially boost fine-tuning generalization, though highly complex tasks challenge contextualized approaches. The work provides a flexible, domain-aware framework for evaluating and improving reasoning in LLMs and highlights instantiated data as a powerful resource for generalization in practice.

Abstract

This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM's reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: https://github.com/agiresearch/ContextHub.
Paper Structure (30 sections, 2 equations, 9 figures, 2 tables)

This paper contains 30 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Benchmark Construction Procedure
  • Figure 2: Main Benchmark Performance
  • Figure 3: Abstract performance vs. instantiated performance
  • Figure 4: Results of weighted F1-score and Chi-square test
  • Figure 5: Results of weighted F1-score and Chi-square test (Cont.)
  • ...and 4 more figures