Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

Yuan Yang; Siheng Xiong; Ali Payani; Ehsan Shareghi; Faramarz Fekri

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, Faramarz Fekri

TL;DR

The study tackles NL to FOL translation by training a compact LoRA-tuned LLaMA-7B model, LogicLLaMA, to directly translate NL to FOL and to correct outputs from a larger model (GPT-3.5) using a novel SFT plus RLHF framework guided by a FOL verifier. A 34K NL-FOL dataset, MALLS, generated from GPT-4 provides silver labels for fine-tuning, enabling strong performance on gold benchmarks LogicNLI and FOLIO. Results show LogicLLaMA can surpass GPT-3.5 in direct translation and reach GPT-4 level performance on FOLIO through chain-of-thought correction, all at a fraction of the cost and with single-GPU training. The work demonstrates the viability of local, task-specific fine-tuning on outputs of larger models and offers dataset and code resources to the community.

Abstract

Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LogicLLaMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LogicLLaMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LogicLLaMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then fine-tunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LogicLLaMA, we present MALLS (large language $\textbf{M}$odel gener$\textbf{A}$ted N$\textbf{L}$-FO$\textbf{L}$ pair$\textbf{S}$), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at $\href{https://github.com/gblackout/LogicLLaMA}{\small \text{https://github.com/gblackout/LogicLLaMA}}$.

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

TL;DR

Abstract

odel gener

ted N

-FO

pair

), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at

Paper Structure (22 sections, 2 equations, 13 figures, 5 tables)

This paper contains 22 sections, 2 equations, 13 figures, 5 tables.

Introduction
Related Work
Malls Dataset Creation
Prompt pipeline
Dataset statistics
Fine-tuning LogicLLaMA for NL-FOL Translation
Fine-tuning for direct translation and naive correction
Chain-of-Thought correction via SFT and RLHF
FOL Rule Perturbations and SFT
RLHF for CoT correction
FOL evaluation and reward design
Experiments
Results
Analysis
Conclusion
...and 7 more sections

Figures (13)

Figure 1: Snippet from the top 200 frequent FOL term pairs in Malls (for full version see Appendix \ref{['app:dataset_app']}). Many terms are associated with a wide range of other terms, which suggests the rules are semantically and contextually diverse.
Figure 2: Input and expected outputs for direct translation, naive correction, and CoT correction.
Figure 2: The list of all atomic perturbations.
Figure 3: Overview of the SFT and RLHF training for the Chain-of-Thought (CoT) correction mode of LogicLLaMA.
Figure 3: BLEU and the logical equivalence (LE) scores of LogicLLaMA and GPT models on LogicNLI and FOLIO. Direct translation using LogicLLaMA outperforms GPT-3.5 and CoT correction achieves a similar performance as 5-shot GPT-4.
...and 8 more figures

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

TL;DR

Abstract

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)