Table of Contents
Fetching ...

Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures

Junjie Chen, Xiangheng He, Danushka Bollegala, Yusuke Miyao

TL;DR

This work tackles unsupervised constituency parsing by leveraging frequency information of word sequences within PAS-equivalent sentences. It introduces the span-overlap parser, which generates PAS-equivalent sentences via instruction-following LLMs, computes a span-overlap score that reflects word-sequence frequency, and decodes a maximum-score constituent tree with efficient decoding. Empirically, the approach achieves state-of-the-art or near-state-of-the-art performance in eight of ten languages and reveals a multilingual pattern in which participant-denoting constituents have higher span-overlap than equal-length event-denoting constituents. Ablation shows meaningful gains from aggregating multiple PAS-preserving transformations, though sample quality and language-specific transformations remain important limitations for future work.

Abstract

Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser span-overlap that (1) computes the span-overlap score as the word sequence's frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.

Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures

TL;DR

This work tackles unsupervised constituency parsing by leveraging frequency information of word sequences within PAS-equivalent sentences. It introduces the span-overlap parser, which generates PAS-equivalent sentences via instruction-following LLMs, computes a span-overlap score that reflects word-sequence frequency, and decodes a maximum-score constituent tree with efficient decoding. Empirically, the approach achieves state-of-the-art or near-state-of-the-art performance in eight of ten languages and reveals a multilingual pattern in which participant-denoting constituents have higher span-overlap than equal-length event-denoting constituents. Ablation shows meaningful gains from aggregating multiple PAS-preserving transformations, though sample quality and language-specific transformations remain important limitations for future work.

Abstract

Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser span-overlap that (1) computes the span-overlap score as the word sequence's frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.
Paper Structure (16 sections, 3 equations, 7 figures, 8 tables)

This paper contains 16 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A target sentence and a set of PAS-equivalent sentences. We highlight in bold constituents that are frequent in the PAS-equivalent sentence set.
  • Figure 2: Overview of the span-overlap method. The grey box indicates the set of PAS-equivalent sentences.
  • Figure 3: The mean span-overlap score and their distributions for S, VP, NP, PP, QP, and Random in the PTB development set. The distribution is approximated with histograms.
  • Figure 4: Mean and skewness values of span-overlap scores.
  • Figure 5: Proportion of gold constituents to which the scores assign strictly higher values than non-constituents
  • ...and 2 more figures