Table of Contents
Fetching ...

Revisiting Absence withSymptoms that *T* Show up Decades Later to Recover Empty Categories

Emily Chen, Nicholas Huang, Casey Robinson, Kevin Xu, Zihao Huang, Jungyeul Park

TL;DR

This work tackles the challenge of restoring null elements in parse trees across English, Chinese, and Korean by combining rule-based methods (extended to CTB) with language-agnostic neural seq2seq approaches. It demonstrates that CTB rule-based restoration can achieve an average F1 around 80.0, while neural models attain high performance across languages (e.g., up to 90.94 on English with labels, 85.38 on Chinese, and 88.79 on Korean in the without-label setting). The study also analyzes the trade-offs between post-processing and rule-based methods, and discusses the potential for semantic cues to further improve restoration, especially for pro/PRO distinctions in Chinese. Overall, neural restoration shows strong cross-linguistic promise, offering a practical path to recovering implicit syntactic and semantic information in languages with varied null-element inventories. This cross-linguistic perspective advances parsing accuracy and downstream tasks such as machine translation by more faithfully representing underlying syntactic structures.

Abstract

This paper explores null elements in English, Chinese, and Korean Penn treebanks. Null elements contain important syntactic and semantic information, yet they have typically been treated as entities to be removed during language processing tasks, particularly in constituency parsing. Thus, we work towards the removal and, in particular, the restoration of null elements in parse trees. We focus on expanding a rule-based approach utilizing linguistic context information to Chinese, as rule based approaches have historically only been applied to English. We also worked to conduct neural experiments with a language agnostic sequence-to-sequence model to recover null elements for English (PTB), Chinese (CTB) and Korean (KTB). To the best of the authors' knowledge, null elements in three different languages have been explored and compared for the first time. In expanding a rule based approach to Chinese, we achieved an overall F1 score of 80.00, which is comparable to past results in the CTB. In our neural experiments we achieved F1 scores up to 90.94, 85.38 and 88.79 for English, Chinese, and Korean respectively with functional labels.

Revisiting Absence withSymptoms that *T* Show up Decades Later to Recover Empty Categories

TL;DR

This work tackles the challenge of restoring null elements in parse trees across English, Chinese, and Korean by combining rule-based methods (extended to CTB) with language-agnostic neural seq2seq approaches. It demonstrates that CTB rule-based restoration can achieve an average F1 around 80.0, while neural models attain high performance across languages (e.g., up to 90.94 on English with labels, 85.38 on Chinese, and 88.79 on Korean in the without-label setting). The study also analyzes the trade-offs between post-processing and rule-based methods, and discusses the potential for semantic cues to further improve restoration, especially for pro/PRO distinctions in Chinese. Overall, neural restoration shows strong cross-linguistic promise, offering a practical path to recovering implicit syntactic and semantic information in languages with varied null-element inventories. This cross-linguistic perspective advances parsing accuracy and downstream tasks such as machine translation by more faithfully representing underlying syntactic structures.

Abstract

This paper explores null elements in English, Chinese, and Korean Penn treebanks. Null elements contain important syntactic and semantic information, yet they have typically been treated as entities to be removed during language processing tasks, particularly in constituency parsing. Thus, we work towards the removal and, in particular, the restoration of null elements in parse trees. We focus on expanding a rule-based approach utilizing linguistic context information to Chinese, as rule based approaches have historically only been applied to English. We also worked to conduct neural experiments with a language agnostic sequence-to-sequence model to recover null elements for English (PTB), Chinese (CTB) and Korean (KTB). To the best of the authors' knowledge, null elements in three different languages have been explored and compared for the first time. In expanding a rule based approach to Chinese, we achieved an overall F1 score of 80.00, which is comparable to past results in the CTB. In our neural experiments we achieved F1 scores up to 90.94, 85.38 and 88.79 for English, Chinese, and Korean respectively with functional labels.

Paper Structure

This paper contains 22 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: English (PTB), with and without traces
  • Figure 2: Chinese (CTB), with and without traces
  • Figure 3: Example of the linearization dataset