Table of Contents
Fetching ...

Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

Elena Chistova

TL;DR

We address the challenge of end-to-end RST parsing across 18 treebanks in 11 languages with incompatible relation inventories. We introduce UniRST, a unified parser built on the DMRST backbone, and evaluate two training strategies—Multi-Head and Masked-Union—to respect inventory differences while enabling cross-treebank transfer; Unmasked-Union serves as a lower-bound baseline. Empirically, Masked-Union with treebank-specific segmentation heads yields the strongest performance, with UniRST surpassing 16 of 18 mono-treebank baselines and achieving robust cross-lingual parsing across diverse resources, including GENTLE out-of-domain data. The work demonstrates the feasibility and benefits of end-to-end multilingual discourse parsing that embraces annotation heterogeneity, while noting limitations related to inventory disparities and data quality across treebanks.

Abstract

We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.

Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

TL;DR

We address the challenge of end-to-end RST parsing across 18 treebanks in 11 languages with incompatible relation inventories. We introduce UniRST, a unified parser built on the DMRST backbone, and evaluate two training strategies—Multi-Head and Masked-Union—to respect inventory differences while enabling cross-treebank transfer; Unmasked-Union serves as a lower-bound baseline. Empirically, Masked-Union with treebank-specific segmentation heads yields the strongest performance, with UniRST surpassing 16 of 18 mono-treebank baselines and achieving robust cross-lingual parsing across diverse resources, including GENTLE out-of-domain data. The work demonstrates the feasibility and benefits of end-to-end multilingual discourse parsing that embraces annotation heterogeneity, while noting limitations related to inventory disparities and data quality across treebanks.

Abstract

We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.

Paper Structure

This paper contains 18 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Model variants in the UniRST framework. (a) Multi-Head: independent classifiers per relation inventory. (b) Masked-Union: shared classifier with treebank-specific label masking.
  • Figure 2: Relation class frequency across treebanks.
  • Figure 3: Relation class frequency (continuation).