Improving Unsupervised Constituency Parsing via Maximizing Semantic Information
Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala
TL;DR
The paper introduces SemInfo, a semantics-aware objective for unsupervised constituency parsing, built on a bag-of-substrings representation of sentence meaning and a paraphrase-derived estimation framework using the Probability-Weighted Information metric. SemInfo is integrated into PCFG parsing via a TreeCRF-based mean-field training pipeline, enabling efficient optimization that combines SemInfo with the traditional log-likelihood term. Empirically, SemInfo-trained PCFGs exhibit stronger alignment with parsing accuracy than LL, delivering substantial improvements across four languages and achieving near state-of-the-art performance in three of them while using fewer parameters. The work demonstrates that embedding semantic signals into syntactic parsing objectives yields robust, scalable gains and highlights the potential for semantically informed unsupervised parsing approaches.
Abstract
Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective that trains parsers by maximizing SemInfo, the semantic information encoded in constituent structures. We introduce a bag-of-substrings model to represent the semantics and estimate the SemInfo value using the probability-weighted information metric. We apply the SemInfo maximization objective to training Probabilistic Context-Free Grammar (PCFG) parsers and develop a Tree Conditional Random Field (TreeCRF)-based model to facilitate the training. Experiments show that SemInfo correlates more strongly with parsing accuracy than LL, establishing SemInfo as a better unsupervised parsing objective. As a result, our algorithm significantly improves parsing accuracy by an average of 7.85 sentence-F1 scores across five PCFG variants and in four languages, achieving state-of-the-art level results in three of the four languages.
