Table of Contents
Fetching ...

From Partial to Strictly Incremental Constituent Parsing

Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

TL;DR

This work investigates fully incremental constituent parsing using strictly left-to-right encoder–decoder architectures, where each word from the input prefix $w_1...w_i$ is added to a partial tree with a small lookahead $k\in\{0,1,2\}$. It evaluates two incremental decoding paradigms—a tagging-based approach and a transition-based approach with a graph neural network—while comparing incremental encoders, including multilingual LLMs (e.g., $mGPT$, $BLOOM$-560M) and a 4-layer LSTM baseline. The results show that encoder quality largely governs performance under strict incrementality; transition-based decoders perform better than tagging, and modest delays substantially improve accuracy, though gaps to non-incremental baselines remain, especially for lower-resource languages and in the absence of bidirectional encoding. The findings highlight encoder-centered bottlenecks and suggest directions toward real-time speculative decoding and broader multilingual evaluation to move parsing closer to human-like incremental processing.

Abstract

We study incremental constituent parsers to assess their capacity to output trees based on prefix representations alone. Guided by strictly left-to-right generative language models and tree-decoding modules, we build parsers that adhere to a strong definition of incrementality across languages. This builds upon work that asserted incrementality, but that mostly only enforced it on either the encoder or the decoder. Finally, we conduct an analysis against non-incremental and partially incremental models.

From Partial to Strictly Incremental Constituent Parsing

TL;DR

This work investigates fully incremental constituent parsing using strictly left-to-right encoder–decoder architectures, where each word from the input prefix is added to a partial tree with a small lookahead . It evaluates two incremental decoding paradigms—a tagging-based approach and a transition-based approach with a graph neural network—while comparing incremental encoders, including multilingual LLMs (e.g., , -560M) and a 4-layer LSTM baseline. The results show that encoder quality largely governs performance under strict incrementality; transition-based decoders perform better than tagging, and modest delays substantially improve accuracy, though gaps to non-incremental baselines remain, especially for lower-resource languages and in the absence of bidirectional encoding. The findings highlight encoder-centered bottlenecks and suggest directions toward real-time speculative decoding and broader multilingual evaluation to move parsing closer to human-like incremental processing.

Abstract

We study incremental constituent parsers to assess their capacity to output trees based on prefix representations alone. Guided by strictly left-to-right generative language models and tree-decoding modules, we build parsers that adhere to a strong definition of incrementality across languages. This builds upon work that asserted incrementality, but that mostly only enforced it on either the encoder or the decoder. Finally, we conduct an analysis against non-incremental and partially incremental models.
Paper Structure (22 sections, 3 equations, 3 figures, 6 tables)

This paper contains 22 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Absolute (orange) and relative (green) indexing from gomez-rodriguez-vilares-2018-constituent. Note that unary chains are collapsed in an artificial constituent (first label). The final label indicates the end of sentence.
  • Figure 2: Transitions defined by yang2020strongly for a partial tree $T_3$ when a new word $w_4$ is added. Nodes in $\mathcal{R}(T_3)$ are marked in blue color.
  • Figure 3: F-Score of absolute (orange), relative (green) and transition-based (purple) decoders with mGPT (bars) and XLM-RoBERTa (dots) encoders per constituent. Different textures are used for delay 0 (solid), 1 (dotted) and 2 (gridded).