Are Language Models Efficient Reasoners? A Perspective from Logic Programming
Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf
TL;DR
We propose a formal, logic-programming–driven lens to evaluate LM reasoning efficiency by treating proofs as shortest $(S, v)$-hyperpaths in a proof hypergraph and by mapping LM-generated natural-language proofs to verbalized logic-program proofs via a disjoint verbalizer. The study constructs GSM word problems with multiple irrelevant axiom sets and analyzes how lexical overlap and the number of distractors affect accuracy and the prevalence of irrelevant intermediate theorems. Across models and settings, accuracy declines when irrelevant content is introduced, and LM proofs frequently detour through unnecessary steps, with efficiency quantified by the ratio of the length of the LM-produced proof to the shortest possible proof. The results demonstrate that token-based measures miss critical aspects of reasoning quality and highlight the value of a formal, proof-centric framework to diagnose and improve LM reasoning efficiency in structured tasks like math word problems. Overall, the paper provides a principled approach to assess and guide efficient deductive reasoning in LMs, with practical implications for designing prompts, architectures, and training objectives that encourage minimal, goal-directed inference. $| ext{Proof}^*|/| ext{Proof}|$ and the shortest $(S, v)$-hyperpath formalism offer concrete metrics and tools for future research on efficient reasoning in language models.
Abstract
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
