Table of Contents
Fetching ...

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala

TL;DR

The paper introduces SemInfo, a semantics-aware objective for unsupervised constituency parsing, built on a bag-of-substrings representation of sentence meaning and a paraphrase-derived estimation framework using the Probability-Weighted Information metric. SemInfo is integrated into PCFG parsing via a TreeCRF-based mean-field training pipeline, enabling efficient optimization that combines SemInfo with the traditional log-likelihood term. Empirically, SemInfo-trained PCFGs exhibit stronger alignment with parsing accuracy than LL, delivering substantial improvements across four languages and achieving near state-of-the-art performance in three of them while using fewer parameters. The work demonstrates that embedding semantic signals into syntactic parsing objectives yields robust, scalable gains and highlights the potential for semantically informed unsupervised parsing approaches.

Abstract

Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective that trains parsers by maximizing SemInfo, the semantic information encoded in constituent structures. We introduce a bag-of-substrings model to represent the semantics and estimate the SemInfo value using the probability-weighted information metric. We apply the SemInfo maximization objective to training Probabilistic Context-Free Grammar (PCFG) parsers and develop a Tree Conditional Random Field (TreeCRF)-based model to facilitate the training. Experiments show that SemInfo correlates more strongly with parsing accuracy than LL, establishing SemInfo as a better unsupervised parsing objective. As a result, our algorithm significantly improves parsing accuracy by an average of 7.85 sentence-F1 scores across five PCFG variants and in four languages, achieving state-of-the-art level results in three of the four languages.

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

TL;DR

The paper introduces SemInfo, a semantics-aware objective for unsupervised constituency parsing, built on a bag-of-substrings representation of sentence meaning and a paraphrase-derived estimation framework using the Probability-Weighted Information metric. SemInfo is integrated into PCFG parsing via a TreeCRF-based mean-field training pipeline, enabling efficient optimization that combines SemInfo with the traditional log-likelihood term. Empirically, SemInfo-trained PCFGs exhibit stronger alignment with parsing accuracy than LL, delivering substantial improvements across four languages and achieving near state-of-the-art performance in three of them while using fewer parameters. The work demonstrates that embedding semantic signals into syntactic parsing objectives yields robust, scalable gains and highlights the potential for semantically informed unsupervised parsing approaches.

Abstract

Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective that trains parsers by maximizing SemInfo, the semantic information encoded in constituent structures. We introduce a bag-of-substrings model to represent the semantics and estimate the SemInfo value using the probability-weighted information metric. We apply the SemInfo maximization objective to training Probabilistic Context-Free Grammar (PCFG) parsers and develop a Tree Conditional Random Field (TreeCRF)-based model to facilitate the training. Experiments show that SemInfo correlates more strongly with parsing accuracy than LL, establishing SemInfo as a better unsupervised parsing objective. As a result, our algorithm significantly improves parsing accuracy by an average of 7.85 sentence-F1 scores across five PCFG variants and in four languages, achieving state-of-the-art level results in three of the four languages.
Paper Structure (34 sections, 23 equations, 13 figures, 7 tables, 4 algorithms)

This paper contains 34 sections, 23 equations, 13 figures, 7 tables, 4 algorithms.

Figures (13)

  • Figure 1: An illustration of the progressive semantics build-up in accordance with the constituent structure. The tree structure in the top-right shows the simplified constituent structure for illustration purposes. Constituent substrings are highlighted in blue.
  • Figure 2: Parallel structure between the traditional bag-of-words representation of topics and the proposed bag-of-substrings representation of semantics.
  • Figure 3: An example for naive substring frequency among paraphrases failing to estimate $P(s|Sem(x))$.
  • Figure 4: Pipeline of our SemInfo maximization training
  • Figure 5: Spearman rank analysis of (SemInfo, LL, $\text{SF1}^i$) pairs obtained from eight independently trained NPCFG models. The values are measured on two sentences in the English dataset. Please refer to Figure \ref{['fig:sent_level-corr-6figs']} for more examples.
  • ...and 8 more figures

Theorems & Definitions (1)

  • proof