Table of Contents
Fetching ...

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Hayate Funakura, Hyunsoo Kim, Koji Mineshima

TL;DR

Graph-based metrics such as Smatch capture surface overlap but not logical equivalence; the study evaluates logical adequacy by bidirectional entailment between gold and predicted formulas using automated theorem proving, comparing supervised fine-tuning (T5-Small/Base) and in-context learning (GPT-4o/4.1/5) across raw and prenex-normalized targets. Ideally $SR_p(S)$ is logically equivalent to $SR_g(S)$, and normalization improves well-formedness and logical adequacy, while high graph-overlap does not guarantee equivalence. Results show normalization reduces incidental variability and strengthens logical alignment, whereas complexity and certain syntactic phenomena (coordination, prepositional phrases, passive voice) degrade prover accuracy; errors cluster around variable binding and predicate naming. The findings motivate logic-sensitive evaluation and training objectives, and suggest a practical hybrid pipeline that combines generation with logic-aware verification; all code and data are publicly available.

Abstract

Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

TL;DR

Graph-based metrics such as Smatch capture surface overlap but not logical equivalence; the study evaluates logical adequacy by bidirectional entailment between gold and predicted formulas using automated theorem proving, comparing supervised fine-tuning (T5-Small/Base) and in-context learning (GPT-4o/4.1/5) across raw and prenex-normalized targets. Ideally is logically equivalent to , and normalization improves well-formedness and logical adequacy, while high graph-overlap does not guarantee equivalence. Results show normalization reduces incidental variability and strengthens logical alignment, whereas complexity and certain syntactic phenomena (coordination, prepositional phrases, passive voice) degrade prover accuracy; errors cluster around variable binding and predicate naming. The findings motivate logic-sensitive evaluation and training objectives, and suggest a practical hybrid pipeline that combines generation with logic-aware verification; all code and data are publicly available.

Abstract

Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

Paper Structure

This paper contains 29 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Relationship between target-side formula complexity and model performance. (a) best SFT configuration (T5-Base, seed 5); (b) best ICL configuration (GPT-5, seed 1).
  • Figure 2: Prover accuracy stratified by the presence or absence of syntactic features. X-axis labels follow Section \ref{['subsec:annotation']}. (a) best SFT configuration (T5-Base, seed 5); (b) best ICL configuration (GPT-5, seed 1).
  • Figure 3: Error type distribution of 771 prover-failed predictions by T5-Base.