Table of Contents
Fetching ...

An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese

Duc-Vu Nguyen, Thang Chau Phan, Quoc-Nam Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

TL;DR

The paper tackles Vietnamese parsing by building a neural parser grounded in a simplified HPSG framework and addressing corpus nonconformities in VietTreebank and VnDT. It integrates PhoBERT and XLM-RoBERTa encodings within a Joint Span HPSG architecture, using permutation-based corrections to align training data with the simplified rules. On VTB and VnDT, the approach achieves a constituency F-score of $82.34\%$ and competitive UAS for dependency parsing, though LAS is lower due to preserving original labels without linguistic expert input. VLSP 2023 experiments show competitive performance, with the HPSG model reaching a private-test F-score of $89.04\%$, highlighting the value of linguistic guidance in treebank annotation for Vietnamese NLP.

Abstract

In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from the training and development sets to make them compliant with simplified HPSG. We then modified the first simplified HPSG Neural Parser for the Penn Treebank by replacing it with the PhoBERT or XLM-RoBERTa models, which can encode Vietnamese texts. We conducted experiments on our modified VietTreebank and VnDT corpora. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art F-score of 82% for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser obtained lower Labeled Attachment Score (LAS) scores likely due to our focus on arc permutation without changing the original labels, as we did not consult with a linguistic expert. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic expert when developing treebanks for Vietnamese natural language processing.

An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese

TL;DR

The paper tackles Vietnamese parsing by building a neural parser grounded in a simplified HPSG framework and addressing corpus nonconformities in VietTreebank and VnDT. It integrates PhoBERT and XLM-RoBERTa encodings within a Joint Span HPSG architecture, using permutation-based corrections to align training data with the simplified rules. On VTB and VnDT, the approach achieves a constituency F-score of and competitive UAS for dependency parsing, though LAS is lower due to preserving original labels without linguistic expert input. VLSP 2023 experiments show competitive performance, with the HPSG model reaching a private-test F-score of , highlighting the value of linguistic guidance in treebank annotation for Vietnamese NLP.

Abstract

In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from the training and development sets to make them compliant with simplified HPSG. We then modified the first simplified HPSG Neural Parser for the Penn Treebank by replacing it with the PhoBERT or XLM-RoBERTa models, which can encode Vietnamese texts. We conducted experiments on our modified VietTreebank and VnDT corpora. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art F-score of 82% for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser obtained lower Labeled Attachment Score (LAS) scores likely due to our focus on arc permutation without changing the original labels, as we did not consult with a linguistic expert. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic expert when developing treebanks for Vietnamese natural language processing.

Paper Structure

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Constituent, dependency, and joint span structures, extracted from the training datasets of VTB and VnDT and intended solely for visualization purposes, may contain slight labeling errors. These structures represent the same Vietnamese sentence, indexed from 1 to 7 and assigned an interval range for each node. The sentence "Gần bùn mà thấy trời xanh" translates to "Close to the mud but seeing the blue sky." The Adjective Phrase (AP) "Gần bùn" corresponds to "close to the mud," and the Verb Phrase (VP) "thấy trời xanh" corresponds to "seeing the blue sky." Dependency arcs indicate grammatical relationships such as subject, object, and modifiers. The joint span structure combines constituent and dependency structures, explicitly marking the category (Categ) and head word (HEAD) for each span.
  • Figure 2: Distributions in the VLSP 2023 Vietnamese Treebank training set.
  • Figure 3: Balancing Constituency and Dependency in Joint Span HPSG Parsing on the VTB & VnDT Development Sets.
  • Figure 4: An attempted version of head rules for the VLSP 2023 Vietnamese Treebank was developed with a non-linguistic engineering background.
  • Figure 5: Our tagging and parsing results on the public set of the VLSP 2023 Vietnamese Treebank.