Table of Contents
Fetching ...

Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao

TL;DR

The paper investigates whether LLMs truly excel at complex logical reasoning when guided by formal languages. It proposes a three-dimensional evaluation framework spanning model spectrum, task taxonomy, and trajectory formats and assesses 66 datasets across Text, PoT, Z3, and CSP in zero-shot settings. Key findings show that Thinking models outperform Instruct models, inductive reasoning remains a weakness, and PoT formats generalize well, while formal languages degrade on harder tasks; a simple rejected-fine-tuning data augmentation significantly boosts performance for small models. The results inform the design of future reasoning systems and dataset curation for formal-language tasks, offering practical strategies to bridge the gap between natural language reasoning and formal-symbolic computation.

Abstract

Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

TL;DR

The paper investigates whether LLMs truly excel at complex logical reasoning when guided by formal languages. It proposes a three-dimensional evaluation framework spanning model spectrum, task taxonomy, and trajectory formats and assesses 66 datasets across Text, PoT, Z3, and CSP in zero-shot settings. Key findings show that Thinking models outperform Instruct models, inductive reasoning remains a weakness, and PoT formats generalize well, while formal languages degrade on harder tasks; a simple rejected-fine-tuning data augmentation significantly boosts performance for small models. The results inform the design of future reasoning systems and dataset curation for formal-language tasks, offering practical strategies to bridge the gap between natural language reasoning and formal-symbolic computation.

Abstract

Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

Paper Structure

This paper contains 42 sections, 5 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Evaluation framework with three specific dimensions: spectrum of LLMs, taxonomy of logical reasoning tasks, and format of trajectories.
  • Figure 2: Radar plots illustrating the performance (%) of multiple LLMs across different reasoning task types (Deductive, Inductive, Abductive, Mixed Form) and trajectory formats (Text, PoT, Z3, CSP). Overall (top 1 × 4) shows aggregated performance by reasoning type and format. Fine-grained (below 4 × 4) present fine-grained results on individual tasks
  • Figure 3: Preferred reasoning task performance across different trajectory formats (Text, PoT, Z3, CSP) in GPT-4o results. Each subplot shows task accuracy under different formats, with execution rate (Exec Rate) plotted as a black line. The highlighted bars represent the most preferred trajectory format for each task.
  • Figure 4: Generalization performance across fine-grained (task type × format) configurations. Each cell shows the performance gain ($\Delta$) from training on the row configuration and evaluating on the column configuration
  • Figure 5: Generalization performance across reasoning types and trajectory formats (coarse-grained analysis). Each cell reports the performance gain ($\Delta$) when training on the row group and evaluating on the column group.
  • ...and 9 more figures