Table of Contents
Fetching ...

Greedy Grammar Induction with Indirect Negative Evidence

Joseph Potashnik

TL;DR

This work reframes CFG induction through indirect negative evidence, using a fitness function that compares observed basic strings to those a candidate grammar can generate up to the pumping lemma bound $p$. The method defines an optimal-substructure landscape over a lattice of production-rule sets and applies a greedy, BFS-like branch-and-bound learner that traverses only nonincreasing-f paths. Learnability is analyzed by introducing $(m,k)$-incremental grammar classes and showing correct weak equivalence to the target grammar under certain conditions, with efficiency hinging on the growth of the optimal solution set $\mathcal{O}(D,t)$. The approach, demonstrated on POS-tagged nonlexicalized CFGs, suggests that many natural grammars are efficiently learnable incrementally and highlights directions for parameter estimation and ranking of candidate grammars. The work blends pumping-lemma theory, Earley parsing, and dynamic-programming grammar counts to enable a tractable, evidence-guided search in CFG space.

Abstract

This paper offers a fresh look at the pumping lemma constant as an upper bound on the information required for learning Context Free Grammars. An objective function based on indirect negative evidence considers the occurrences, and non-occurrences, of a finite number of strings, encountered after a sufficiently long presentation. This function has optimal substructure in the hypotheses space, giving rise to a greedy search learner in a branch and bound method. A hierarchy of learnable classes is defined in terms of the number of production rules that must be added to interim solutions in order to incrementally fit the input. Efficiency strongly depends on the position of the target grammar in the hierarchy and on the richness of the input.

Greedy Grammar Induction with Indirect Negative Evidence

TL;DR

This work reframes CFG induction through indirect negative evidence, using a fitness function that compares observed basic strings to those a candidate grammar can generate up to the pumping lemma bound . The method defines an optimal-substructure landscape over a lattice of production-rule sets and applies a greedy, BFS-like branch-and-bound learner that traverses only nonincreasing-f paths. Learnability is analyzed by introducing -incremental grammar classes and showing correct weak equivalence to the target grammar under certain conditions, with efficiency hinging on the growth of the optimal solution set . The approach, demonstrated on POS-tagged nonlexicalized CFGs, suggests that many natural grammars are efficiently learnable incrementally and highlights directions for parameter estimation and ranking of candidate grammars. The work blends pumping-lemma theory, Earley parsing, and dynamic-programming grammar counts to enable a tractable, evidence-guided search in CFG space.

Abstract

This paper offers a fresh look at the pumping lemma constant as an upper bound on the information required for learning Context Free Grammars. An objective function based on indirect negative evidence considers the occurrences, and non-occurrences, of a finite number of strings, encountered after a sufficiently long presentation. This function has optimal substructure in the hypotheses space, giving rise to a greedy search learner in a branch and bound method. A hierarchy of learnable classes is defined in terms of the number of production rules that must be added to interim solutions in order to incrementally fit the input. Efficiency strongly depends on the position of the target grammar in the hierarchy and on the richness of the input.
Paper Structure (16 sections, 2 theorems, 15 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 2 theorems, 15 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2.1

Any target CFG $G$ is identifiable in the limit given a fair basic text generated by it.

Figures (2)

  • Figure 1: The Pumping Lemma
  • Figure 2: Left/Right recursion of Adjectives under an NP constituent

Theorems & Definitions (4)

  • Definition 2.1: a fair basic text
  • Theorem 2.1: Learnability of CFGs
  • Theorem 4.1: Behaviour of $f$ in fair basic texts
  • Definition 7.1: $(m,k)$-incremental classes