BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing

Subhro Roy; Sam Thomson; Tongfei Chen; Richard Shin; Adam Pauls; Jason Eisner; Benjamin Van Durme

BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing

Subhro Roy, Sam Thomson, Tongfei Chen, Richard Shin, Adam Pauls, Jason Eisner, Benjamin Van Durme

TL;DR

BenchCLAMP provides a unified benchmark for evaluating language models on semantic and syntactic parsing using constrained, grammar-guided decoding across nine datasets and seven formalisms. It demonstrates that encoder-decoder models, when constrained to valid outputs, can match or exceed state-of-the-art parsing methods, especially in low-resource settings, and offers practical insights into context usage and prompting for robust parsing. The work highlights the importance of grammar coverage and constrained decoding for reliable generation, and furnishes publicly available grammars and evaluation pipelines to advance parsing-focused LM research. While comprehensive, it also acknowledges limitations such as language scope and API-result variance, outlining directions for broader grammar coverage and multilingual evaluation.

Abstract

Recent work has shown that generation from a prompted or fine-tuned language model can perform well at semantic parsing when the output is constrained to be a valid semantic representation. We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing, that includes context-free grammars for seven semantic parsing datasets and two syntactic parsing datasets with varied output representations, as well as a constrained decoding interface to generate only valid outputs covered by these grammars. We provide low, medium, and high resource splits for each dataset, allowing accurate comparison of various language models under different data regimes. Our benchmark supports evaluation of language models using prompt-based learning as well as fine-tuning. We benchmark eight language models, including two GPT-3 variants available only through an API. Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.

BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing

TL;DR

Abstract

BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing

Authors

TL;DR

Abstract

Table of Contents