We can still parse using syntactic rules
Ghaly Hussein
TL;DR
The paper revisits explicit syntactic parsing in the era of transformers by proposing a GPSG/HPSG-inspired parsing framework that yields both dependency and constituency structures while handling noise and incomplete input. It presents an incremental, rule-based parser with tokenization, probabilistic POS tagging, phrase creation/indexing/scanning, phrase projection, and a Dijkstra-like connecting and reranking component, exported as both dependency and constituency outputs. Evaluation on UD English treebanks shows average dev/test UAS around $54.5\%$ and $53.8\%$ respectively, with higher performance (≈$54.5\%$ and $53.8\%$) achieved using a larger rule set, and qualitative demonstrations of dual-structure outputs and slash-feature handling. The work demonstrates a path to extend to new languages via language-specific POS inventories and rules, contributing to explainable NLP by integrating longstanding syntactic theories with computational parsing, albeit currently English-only and reliant on handcrafted rules.
Abstract
This research introduces a new parsing approach, based on earlier syntactic work on context free grammar (CFG) and generalized phrase structure grammar (GPSG). The approach comprises both a new parsing algorithm and a set of syntactic rules and features that overcome the limitations of CFG. It also generates both dependency and constituency parse trees, while accommodating noise and incomplete parses. The system was tested on data from Universal Dependencies, showing a promising average Unlabeled Attachment Score (UAS) of 54.5% in the development dataset (7 corpora) and 53.8% in the test set (12 corpora). The system also provides multiple parse hypotheses, allowing further reranking to improve parsing accuracy. This approach also leverages much of the theoretical syntactic work since the 1950s to be used within a computational context. The application of this approach provides a transparent and interpretable NLP model to process language input.
