Table of Contents
Fetching ...

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

Serafim Batzoglou

TL;DR

INDUCTION, a benchmark for finite structure concept synthesis in first order logic, is introduced, finding sharp difficulty gradients, persistent hard structural families, and observing that low bloat formulas generalize far better on held out worlds.

Abstract

We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

TL;DR

INDUCTION, a benchmark for finite structure concept synthesis in first order logic, is introduced, finding sharp difficulty gradients, persistent hard structural families, and observing that low bloat formulas generalize far better on held out worlds.

Abstract

We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.
Paper Structure (58 sections, 5 equations, 5 figures, 27 tables)

This paper contains 58 sections, 5 equations, 5 figures, 27 tables.

Figures (5)

  • Figure 1: FullObs v1 budgeted accuracy curves: Acc@(+$\Delta$) for $\Delta \in [0, 100]$.
  • Figure 2: FullObs holdout generalization by AST delta bin. For train-correct predictions, we measure exact-match rate on held-out worlds. Generalization degrades monotonically with formula complexity across all models.
  • Figure 3: CI v1 failure mode distribution by model (stacked bar chart).
  • Figure 4: EC v1 budgeted validity curves: Acc@(+$\Delta$) for $\Delta \in [0, 100]$. Y-axis shows the fraction of instances with valid formulas having AST $\leq$ gold$+\Delta$.
  • Figure 5: CI holdout generalization by AST delta bin. Same methodology as FullObs; bloated formulas generalize worse to new YES worlds.