Table of Contents
Fetching ...

Revisiting Supertagging for Faster HPSG Pasing

Olga Zamaraeva, Carlos Gómez-Rodríguez

TL;DR

This work develops and evaluates three English supertaggers—SVM, neural CRF, and a fine-tuned BERT model—trained on ERG English treebanks to accelerate HPSG parsing. By pruning the lexical chart in the ACE HPSG Parser with high-accuracy supertags, the approach achieves a substantial parsing speedup (around 3×) and improved accuracy across diverse, out-of-domain datasets beyond WSJ23. The study situates itself relative to prior supertagging work, demonstrates the superiority of the BERT-based tagger on multiple domains, and analyzes the tradeoffs between speed and parsing accuracy, including the impact of exception lists. It also provides the ERG-derived datasets reformatted for Huggingface token classification, highlighting the importance of dataset diversity for robust evaluation and future improvements in production-ready integration.

Abstract

We present new supertaggers trained on English grammar-based treebanks and test the effects of the best tagger on parsing speed and accuracy. The treebanks are produced automatically by large manually built grammars and feature high-quality annotation based on a well-developed linguistic theory (HPSG). The English Resource Grammar treebanks include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline and lead to an increase not only in the parsing speed but also the parser accuracy with respect to gold dependency structures. Our fine-tuned BERT-based tagger achieves 97.26\% accuracy on 950 sentences from WSJ23 and 93.88% on the out-of-domain technical essay The Cathedral and the Bazaar (cb). We present experiments with integrating the best supertagger into an HPSG parser and observe a speedup of a factor of 3 with respect to the system which uses no tagging at all, as well as large recall gains and an overall precision gain. We also compare our system to an existing integrated tagger and show that although the well-integrated tagger remains the fastest, our experimental system can be more accurate. Finally, we hope that the diverse and difficult datasets we used for evaluation will gain more popularity in the field: we show that results can differ depending on the dataset, even if it is an in-domain one. We contribute the complete datasets reformatted for Huggingface token classification.

Revisiting Supertagging for Faster HPSG Pasing

TL;DR

This work develops and evaluates three English supertaggers—SVM, neural CRF, and a fine-tuned BERT model—trained on ERG English treebanks to accelerate HPSG parsing. By pruning the lexical chart in the ACE HPSG Parser with high-accuracy supertags, the approach achieves a substantial parsing speedup (around 3×) and improved accuracy across diverse, out-of-domain datasets beyond WSJ23. The study situates itself relative to prior supertagging work, demonstrates the superiority of the BERT-based tagger on multiple domains, and analyzes the tradeoffs between speed and parsing accuracy, including the impact of exception lists. It also provides the ERG-derived datasets reformatted for Huggingface token classification, highlighting the importance of dataset diversity for robust evaluation and future improvements in production-ready integration.

Abstract

We present new supertaggers trained on English grammar-based treebanks and test the effects of the best tagger on parsing speed and accuracy. The treebanks are produced automatically by large manually built grammars and feature high-quality annotation based on a well-developed linguistic theory (HPSG). The English Resource Grammar treebanks include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline and lead to an increase not only in the parsing speed but also the parser accuracy with respect to gold dependency structures. Our fine-tuned BERT-based tagger achieves 97.26\% accuracy on 950 sentences from WSJ23 and 93.88% on the out-of-domain technical essay The Cathedral and the Bazaar (cb). We present experiments with integrating the best supertagger into an HPSG parser and observe a speedup of a factor of 3 with respect to the system which uses no tagging at all, as well as large recall gains and an overall precision gain. We also compare our system to an existing integrated tagger and show that although the well-integrated tagger remains the fastest, our experimental system can be more accurate. Finally, we hope that the diverse and difficult datasets we used for evaluation will gain more popularity in the field: we show that results can differ depending on the dataset, even if it is an in-domain one. We contribute the complete datasets reformatted for Huggingface token classification.
Paper Structure (27 sections, 3 figures, 15 tables)

This paper contains 27 sections, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Part of the HPSG type hierarchy (simplified; adapted from ERG). NB: This is not a derivation.
  • Figure 2: Two interpretations of the sentence The dog barks. The second one is an unlikely noun phrase fragment, which would be discarded with the supertagging technique. (Trees provided by the English Resource Grammar Delphin-viz online demo.)
  • Figure 3: Pareto Frontier (Speed and F-score)