Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser
Elena Chistova
TL;DR
We address the challenge of end-to-end RST parsing across 18 treebanks in 11 languages with incompatible relation inventories. We introduce UniRST, a unified parser built on the DMRST backbone, and evaluate two training strategies—Multi-Head and Masked-Union—to respect inventory differences while enabling cross-treebank transfer; Unmasked-Union serves as a lower-bound baseline. Empirically, Masked-Union with treebank-specific segmentation heads yields the strongest performance, with UniRST surpassing 16 of 18 mono-treebank baselines and achieving robust cross-lingual parsing across diverse resources, including GENTLE out-of-domain data. The work demonstrates the feasibility and benefits of end-to-end multilingual discourse parsing that embraces annotation heterogeneity, while noting limitations related to inventory disparities and data quality across treebanks.
Abstract
We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.
