Table of Contents
Fetching ...

Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

Andrea Brunello, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

TL;DR

The paper identifies inconsistencies in NL→FOL evaluation and introduces a principled benchmarking framework that decomposes NL-FOL translation into Ontology Extraction and Logical Translation. It defines three core tasks—logical translation, most similar, and ranking—within triplets $(p,\varphi,\Omega)$ where $\Omega=(\sigma,\gamma)$ encodes a fixed FOL signature and symbol glossary. Through experiments on two datasets, D_Stanford and D_FOLIO, with six models (four dialogue-oriented and two embedding-centric), the study shows that dialogue-oriented LLMs (e.g., GPT-4o-mini, o3-mini) achieve strong NL-FOL translation and genuine sentence-level logical understanding, while embedding-centric models lag behind. It also demonstrates that standard metrics like BLEU and LE poorly correlate with true logical equivalence, advocating for SMT-based verification and the proposed LT/MS/RK tasks as robust evaluation paradigms with practical implications for formal verification and autoformalization research.

Abstract

Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

TL;DR

The paper identifies inconsistencies in NL→FOL evaluation and introduces a principled benchmarking framework that decomposes NL-FOL translation into Ontology Extraction and Logical Translation. It defines three core tasks—logical translation, most similar, and ranking—within triplets where encodes a fixed FOL signature and symbol glossary. Through experiments on two datasets, D_Stanford and D_FOLIO, with six models (four dialogue-oriented and two embedding-centric), the study shows that dialogue-oriented LLMs (e.g., GPT-4o-mini, o3-mini) achieve strong NL-FOL translation and genuine sentence-level logical understanding, while embedding-centric models lag behind. It also demonstrates that standard metrics like BLEU and LE poorly correlate with true logical equivalence, advocating for SMT-based verification and the proposed LT/MS/RK tasks as robust evaluation paradigms with practical implications for formal verification and autoformalization research.

Abstract

Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

Paper Structure

This paper contains 48 sections, 2 equations, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Example of instantiation of phrases, signatures, and formulas as used in the benchmarking strategy tasks.

Theorems & Definitions (4)

  • Definition 1: First-order signature
  • Definition 2: Syntax of FOL
  • Definition 3: $\sigma$-structure
  • Definition 4