Table of Contents
Fetching ...

Exploring Self-supervised Logic-enhanced Training for Large Language Models

Fangkai Jiao, Zhiyang Teng, Bosheng Ding, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty

TL;DR

LogicLLM introduces a self-supervised, logic-enhanced meta-training regime for LLMs by turning MERIt into an autoregressive objective and constructing logically consistent data from Wikipedia, paired with counterfactual augmentation to strengthen relational reasoning. The framework is model-agnostic and evaluated on FLAN-T5 and LLaMA across multiple benchmarks (ReClor, LogiQA-v2, RACE, MMLU, BBH), showing significant gains in logical reasoning without sacrificing broad language understanding. Larger models benefit more, and the approach remains compatible with instruction tuning, suggesting a scalable path to robust logic in LLMs. Comprehensive analyses of data construction, training strategies, robustness, and compatibility provide practical guidance for deploying logic-aware LLMs.

Abstract

Existing efforts to improve logical reasoning ability of language models have predominantly relied on supervised fine-tuning, hindering generalization to new domains and/or tasks. The development of Large Langauge Models (LLMs) has demonstrated the capacity of compressing abundant knowledge into a single proxy, enabling them to tackle multiple tasks effectively. Our preliminary experiments, nevertheless, show that LLMs do not show capability on logical reasoning. The performance of LLMs on logical reasoning benchmarks is far behind the existing state-of-the-art baselines. In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training, and activating it via in-context learning, which we termed as LogicLLM. Specifically, we devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM. Besides, we conduct extensive ablation studies to analyze the key factors in designing logic-oriented proxy tasks.

Exploring Self-supervised Logic-enhanced Training for Large Language Models

TL;DR

LogicLLM introduces a self-supervised, logic-enhanced meta-training regime for LLMs by turning MERIt into an autoregressive objective and constructing logically consistent data from Wikipedia, paired with counterfactual augmentation to strengthen relational reasoning. The framework is model-agnostic and evaluated on FLAN-T5 and LLaMA across multiple benchmarks (ReClor, LogiQA-v2, RACE, MMLU, BBH), showing significant gains in logical reasoning without sacrificing broad language understanding. Larger models benefit more, and the approach remains compatible with instruction tuning, suggesting a scalable path to robust logic in LLMs. Comprehensive analyses of data construction, training strategies, robustness, and compatibility provide practical guidance for deploying logic-aware LLMs.

Abstract

Existing efforts to improve logical reasoning ability of language models have predominantly relied on supervised fine-tuning, hindering generalization to new domains and/or tasks. The development of Large Langauge Models (LLMs) has demonstrated the capacity of compressing abundant knowledge into a single proxy, enabling them to tackle multiple tasks effectively. Our preliminary experiments, nevertheless, show that LLMs do not show capability on logical reasoning. The performance of LLMs on logical reasoning benchmarks is far behind the existing state-of-the-art baselines. In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training, and activating it via in-context learning, which we termed as LogicLLM. Specifically, we devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM. Besides, we conduct extensive ablation studies to analyze the key factors in designing logic-oriented proxy tasks.
Paper Structure (37 sections, 5 equations, 4 figures, 9 tables)

This paper contains 37 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An example logical reasoning task from LogiQA-v2 dataset logiqa. The relations between different constituents, e.g., agriculture and development of Andean society, include various predicates, and it is hard to be converted into logical form through either first-order logic or formal language.
  • Figure 2: The LogicLLM framework. $P$ and $Q$ are two arbitrary paragraphs from Wikipedia. In Step 1, we extract intra-sentence relations ①: $\langle\,e_{i},s_{k},e_{j}\,\rangle$, and the compositions of them ②: $\langle e_i, s_{i+1}, e_{i+1}, \cdots,s_{j},e_j\rangle$ from $P$ for an entity pair $\langle\,e_{i},\,e_{j}\,\rangle$; ① and ② are direct and indirection relations, respectively. Here $s_k$ is a relation, represented by the sentence that mentions $\langle\,e_{i},\,e_{j}\,\rangle$. ① and ② are viewed as logically consistent since both of them describe the "same" relation between $\langle\,e_{i},\,e_{j}\,\rangle$ from different view. In Part I of the figure, $e_i$ refers to Everdigen and $e_j$ represents Sweden. The intermediate entity is Norwegian here. The direct relation on the left says that Everdigen has traveled to Sweden, and the indirect relation implies the fact that Everdigen has probably visited Sweden as well as its nearby area, otherwise he could not complete the sketches of Norwegian, demonstrating the fuzzy logic consistency with high probability. Step 2 is the process of counterfactual data augmentation, where counterfactual relation composition is generated by random entity replacement. ③ and ④ are the counterfactual augmentations of ① and ②, respectively. Finally, in Step 3, the LLM is optimized to generate direct/indirect relations with their logically consistent indirect/direct counterparts as inputs. Here, ①$\rightarrow$ ②, ②$\rightarrow$ ①, ③$\rightarrow$ ④, and ④$\rightarrow$ ③ are considered.
  • Figure 3: Results of 5 experiments with different option input orders across different model sizes on the test set of LogiQA-v2. Brown circular marker: outlier, green triangle: arithmetic mean value.
  • Figure 4: The averaged log-likelihood value of different models on the self-constructed logically consistent and inconsistent instances, respectively. w/ L. refers to the models augmented with LogicLLM.