Table of Contents
Fetching ...

LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

Shenghao Li

TL;DR

The paper tackles the challenge of logical reasoning in pretrained models by proposing LFC-DA, a symbolic-logic-controlled data augmentation pipeline that maps natural language to propositional formulas, explores the logic space with a DFS-based approach, and instantiates new formulas back into natural text with large language models. This formalization-exploration-instantiation framework aims to produce diverse yet logically rigorous training data, addressing the lack of interpretability and limited variety in purely model-driven augmentation. Empirical results on ReClor and LogiQA show that LFC-DA-generated data significantly improves logical-reasoning accuracy compared with strong baselines, validating the method’s effectiveness and generalization. Overall, LFC-DA offers a scalable, explainable pathway to enhance reasoning in downstream tasks while reducing reliance on manual annotation.

Abstract

For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.

LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

TL;DR

The paper tackles the challenge of logical reasoning in pretrained models by proposing LFC-DA, a symbolic-logic-controlled data augmentation pipeline that maps natural language to propositional formulas, explores the logic space with a DFS-based approach, and instantiates new formulas back into natural text with large language models. This formalization-exploration-instantiation framework aims to produce diverse yet logically rigorous training data, addressing the lack of interpretability and limited variety in purely model-driven augmentation. Empirical results on ReClor and LogiQA show that LFC-DA-generated data significantly improves logical-reasoning accuracy compared with strong baselines, validating the method’s effectiveness and generalization. Overall, LFC-DA offers a scalable, explainable pathway to enhance reasoning in downstream tasks while reducing reliance on manual annotation.

Abstract

For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Data Augmentation Pipeline: 1 Stanza System formalizes natural language into propositional logic formulas; 2 DFS Framework systematically explores and generates diversified novel formulas; building on this, 3 Prompt drives the large language model to instantiate these new formulas into natural text. Ultimately, the pipeline Generates high-quality data to achieve Data Enhancement.
  • Figure 2: This figure illustrates the conversion pipeline from natural language instances to propositional logic formulas: Stanza(process) performs syntactic parsing; Stanza(output) identifies the logical structure $\alpha \rightarrow \beta$ and defines variables ($\alpha$: rain, $\beta$: ground wet); rule(match) matches $[\alpha\rightarrow\beta, \alpha]$ with the rule base, after which rule(output) generates the final formula $(\alpha\rightarrow\beta) \wedge \alpha \Rightarrow \beta$.
  • Figure 3: DFS enumeration process: Using the abstract syntax tree of Modus Ponens as an example to illustrate the depth-first pre-order traversal strategy. The process starts from the (black) root node conjunction symbol $\wedge$, then systematically traverses the (green) left subtree (accessing its internal nodes in the order of $\rightarrow$, $\alpha$, $\beta$), and finally accesses the (black) right child node $\alpha$ under the root node.
  • Figure 4: Alternative Derivation Path: This figure illustrates an alternative derivation path from the premise $(\alpha \rightarrow \beta) \wedge \alpha$ to the conclusion $\beta$. Assuming rule ① is unavailable in the rule base, the system systematically applies fundamental theorems from the rule base, progressively deriving the target through a series of rigorous logical transformations.
  • Figure 6: This framework integrates symbolic logic with LLM capabilities to create interpretable training data. It expands initial formulas into correct (OK) and error (ERROR) state sets, annotates logical attributes via [equivalence] and [contain] markers, constructs labeled sample pairs using three logical relations (① equivalence, ② implication, ③ derivation), and employs customized prompts to instantiate symbolic samples as natural text, ensuring logical correctness, diversity, and interpretability.