Table of Contents
Fetching ...

Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu

TL;DR

The paper tackles symbol drift in LLM-based logical translation, showing that linguistic variation can break downstream deduction. It introduces SoLT, a diversification benchmark built with Logic-invariant Linguistic Diversification (LiLD) to generate linguistically diverse yet logically equivalent problems, and MenTaL, a Mental Representation Table approach that links equivalent expressions to shared symbols during translation. empirical results across six tasks reveal substantial accuracy drops and symbol drift under SoLT, while MenTaL — via in-context guidance or supervised fine-tuning — significantly stabilizes symbol mappings and improves performance (up to ~$40\%$ accuracy gains) on diversified inputs. The work provides a practical benchmark and a generalizable framework to improve robust logical translation, with implications for more reliable neuro-symbolic reasoning in real-world multilingual and varied-text settings; code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Abstract

Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

TL;DR

The paper tackles symbol drift in LLM-based logical translation, showing that linguistic variation can break downstream deduction. It introduces SoLT, a diversification benchmark built with Logic-invariant Linguistic Diversification (LiLD) to generate linguistically diverse yet logically equivalent problems, and MenTaL, a Mental Representation Table approach that links equivalent expressions to shared symbols during translation. empirical results across six tasks reveal substantial accuracy drops and symbol drift under SoLT, while MenTaL — via in-context guidance or supervised fine-tuning — significantly stabilizes symbol mappings and improves performance (up to ~ accuracy gains) on diversified inputs. The work provides a practical benchmark and a generalizable framework to improve robust logical translation, with implications for more reliable neuro-symbolic reasoning in real-world multilingual and varied-text settings; code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Abstract

Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Paper Structure

This paper contains 47 sections, 9 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustrative case of LLM translators failing under linguistic diversification. LLMs map equivalent expressions to different symbols, breaking reasoning chains and preventing the solver from reaching the correct conclusion.
  • Figure 2: Accuracy drop of GPT-4 across six reasoning tasks under four diversification types: third-person reference (Thir), synonym substitution (Syno), part-of-speech shift (Part), and syntactic transformation (Synt).
  • Figure 3: SoLT performs multi-level linguistic diversification while preserving semantics and logic to evaluate translation stability. MenTaL enforces unified symbol mapping through a Mental Representation Table.
  • Figure 4: Accuracy$\uparrow$ and SDS$\downarrow$ on six reasoning tasks across four LLMs under three evaluation settings. PW$_1$ and PW$_2$ denote the ProofWriter dataset paired with the Prover9 and PyKE solvers, respectively. Origin refers to the original dataset with direct prompting, SoLT refers to the SoLT-diversified dataset with direct prompting, and Tuning refers to the SoLT-diversified dataset with prompt-tuning. Accuracy is shown in the upper bars, while SDS is shown in the lower bars.
  • Figure 5: Error statistics for GPT-4 across six tasks.
  • ...and 2 more figures