Table of Contents
Fetching ...

Inductive Learning of Logical Theories with LLMs: An Expressivity-Graded Analysis

João Pedro Gandarela, Danilo S. Carvalho, André Freitas

TL;DR

This work presents a graded methodology for evaluating how well large language models can induce logical theories by coupling LLMs with a formal inference engine (Prolog) and a synthetic data generator that varies rule expressivity and noise. Through iterative theory generation and evaluation, the approach benchmarks against a state-of-the-art ILP system across expressivity categories CHAIN, RDG, DRDG, and MIXED. Findings show that larger LLMs can match or approach ILP performance at higher noise but struggle with long predicate chains and exhibit non-monotonic improvements with more iterations; model size alone is not a reliable predictor of robustness. The proposed framework yields a reusable, graded pipeline for assessing inductive capabilities of LLMs with formal grounding, aiding interpretability and systematic comparison across models and tasks.

Abstract

This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for LLMs.

Inductive Learning of Logical Theories with LLMs: An Expressivity-Graded Analysis

TL;DR

This work presents a graded methodology for evaluating how well large language models can induce logical theories by coupling LLMs with a formal inference engine (Prolog) and a synthetic data generator that varies rule expressivity and noise. Through iterative theory generation and evaluation, the approach benchmarks against a state-of-the-art ILP system across expressivity categories CHAIN, RDG, DRDG, and MIXED. Findings show that larger LLMs can match or approach ILP performance at higher noise but struggle with long predicate chains and exhibit non-monotonic improvements with more iterations; model size alone is not a reliable predictor of robustness. The proposed framework yields a reusable, graded pipeline for assessing inductive capabilities of LLMs with formal grounding, aiding interpretability and systematic comparison across models and tasks.

Abstract

This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for LLMs.
Paper Structure (34 sections, 7 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 7 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: The proposed method to evaluate theory induction with an LLM in Prolog based on background knowledge and training examples. The process starts with a prompt generator $(c)$ that formulates prompts for an LLM $(a)$. Both the background knowledge and training sets are parameterised by different noise and rule expressive power levels: Chain, Rooted Directed Graph (DG), Disjunctive Rooted DG, and Mixed. The LLM generates theories, which are then evaluated by a logic program interpreter $(b)$. The evaluation feedback, including accuracy, precision, recall, and F1 scores, as well as wrongly classified examples, is used to refine the prompts iteratively. We analyse and categorise the generated theories according to their expressive power $(d)$.
  • Figure 2: F1 score trends across categories. Different models (GPT-4o, Llama3 8b instruct, Popper, and Mixtral-8x7B-Instruct-v0.1) under varying noise levels and categories reveal distinct performance patterns. GPT-4o demonstrates stable accuracy yet sensitivity to noise, particularly in complex rule-based categories like RDG and DRDG. Mixtral-8x7B-Instruct-v0.1 exhibits mixed results with notable variability across categories particularly in more complex tasks. Llama3 8b instruct delivers lower scores, indicating challenges in reasoning and theory generation.
  • Figure 3: Performance on time consumption trends across categories using a logarithmic scale. The data consistently shows that LLM outperforms Popper in all intervals. The results however do not represent a measure of efficiency, as the computational resources employed are vastly different across methods.
  • Figure 4: Relationship between the F1 score and the logarithm of processing time (in seconds) for five different mixed models—Popper, GPT-4, GPT-3.5-turbo, Llama3, and Mixtral—across three noise levels: 0.1, 0.2, and 0.3 and each rule set category: CHAIN ($\bullet$), CHAIN R.($\square$), RDG($\pentago$), RDG R.(+), DRDG($\bigstar$), DRDG R.($\diamondsuit$), and MIXED(X). Each subplot corresponds to a different noise level, showing how each model's performance and processing time vary with increasing noise. While Popper always takes more time to develop a theory, the other two levels (0.5 - 1.0, 1.0 - 2.5) correspond to different execution environments. Time variance changes in opposite ways w.r.t. noise on Popper vs. the LLMs.