Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
Felix Vossel, Till Mossakowski, Björn Gehrke
TL;DR
This work tackles the challenging task of translating natural language into first-order logic (NL→FOL) and evaluates how fine-tuned large language models (LLMs) perform across architectures, data strategies, and curricula using the MALLS and Willow datasets. It systematically compares encoder–decoder and decoder-only models, introducing eight targeted experiments (e.g., token extensions, fixed predicate lists, predicate-noise, multi-step curricula, and multilingual training) to identify what drives translation quality, with evaluation through exact matches, semantic equivalence, and predicate matching. A key finding is that predicate availability (gold predicates) significantly boosts performance, and encoder–decoder T5 models can outperform larger decoder-only LLMs, achieving up to 70% equivalence on Willow/MALLS translations with predicate lists; the best model is Flan-T5-XXL. The study also reveals that predicate extraction remains a bottleneck, that learning predicates via two-step curriculums offers limited benefits, and that multilingual training yields only modest changes, while FOLIO-based logical-argument evaluation suggests robust generalization to reasoning tasks beyond the training distribution. Overall, the paper provides a practical framework and resources for advancing NL→FOL formalization with fine-tuned LLMs, highlighting promising directions for predicate extraction, multilingual grounding, and hybrid reasoning architectures.
Abstract
Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
