LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Yanrui Wu; Lingling Zhang; Xinyu Zhang; Jiayu Chang; Pengyu Li; Xu Jiang; Jingtao Hu; Jun Liu

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu

TL;DR

LogicGraph is introduced, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation and proposes a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes.

Abstract

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

TL;DR

Abstract

Paper Structure (59 sections, 7 equations, 17 figures, 7 tables)

This paper contains 59 sections, 7 equations, 17 figures, 7 tables.

Introduction
Related Work
Logical Reasoning Datasets for LLMs
Symbolic Prover Augmented LLMs
Task Formulation
Automatic Dataset Generation Pipeline
Symbolic Logic DAG Generation
Graph Primitives: Inference Node & Family.
Bottom-Up Construction.
Multi-path Generation.
Semantic Instantiation
Solver-based Filtering
Dataset Characteristics
Neuro-Symbolic Evaluation Framework
The Evaluation Pipeline
...and 44 more sections

Figures (17)

Figure 1: Illustration of the multi-path reasoning challenge. In the real world, the same conclusion may be entailed via multiple derivation paths.
Figure 2: LogicGraph generation pipeline. (a) Logic DAG generation (Section \ref{['sec:LogicDAG']}) builds a goal-to-premise directed acyclic graph by sampling argument forms, yielding multiple valid reasoning paths; paths that share intermediate inference nodes are grouped into families. (b) Semantic instantiation (Section \ref{['sec:semantic']}) translates the DAG into Prover9 formulas and renders the corresponding steps into natural language by instantiating abstract entities and scenario context with LLMs. (c) Solver-based Filtering (Section \ref{['sec:quality']}) uses Prover9 to validate each instance.
Figure 3: Three-stage evaluation pipeline for LLM-generated multi-path proofs: (1) pre-processing and auto-formalization extract candidate solutions, resolve references, and translate natural-language steps into a symbolic representation (Prover9-style); (2) a solver is then used to validate local (stepwise) and global validity; (3) failures are annotated along two independent, non-exclusive error axes: semantic comprehension and logical execution.
Figure 4: Performance dynamics of LLMs across varying reasoning depths. Solid lines represent Reasoning-oriented LLMs; dashed lines represent General-purpose LLMs.
Figure 5: Comparative Analysis of Error Type: Semantic Comprehension vs. Logical Execution.
...and 12 more figures

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

TL;DR

Abstract

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (17)