Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures
Junjie Chen, Xiangheng He, Danushka Bollegala, Yusuke Miyao
TL;DR
This work tackles unsupervised constituency parsing by leveraging frequency information of word sequences within PAS-equivalent sentences. It introduces the span-overlap parser, which generates PAS-equivalent sentences via instruction-following LLMs, computes a span-overlap score that reflects word-sequence frequency, and decodes a maximum-score constituent tree with efficient decoding. Empirically, the approach achieves state-of-the-art or near-state-of-the-art performance in eight of ten languages and reveals a multilingual pattern in which participant-denoting constituents have higher span-overlap than equal-length event-denoting constituents. Ablation shows meaningful gains from aggregating multiple PAS-preserving transformations, though sample quality and language-specific transformations remain important limitations for future work.
Abstract
Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser span-overlap that (1) computes the span-overlap score as the word sequence's frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.
