Table of Contents
Fetching ...

ESURF: Simple and Effective EDU Segmentation

Mohammadreza Sediqin, Shlomo Engelson Argamon

TL;DR

This work tackles EDU segmentation as a bottleneck in RST discourse parsing by introducing ESURF, a simple yet effective boundary detector that relies on lexical cues and character n-grams within a 9-token context, learned via random forests. ESURF achieves state-of-the-art segmentation performance on CNN/Daily Mail and RST-DT and yields measurable improvements in a transition-based RST parser, underscoring the value of lightweight, training-efficient features for discourse analysis. The findings highlight the sufficiency of local lexical/morphological information for identifying basic discourse units and point to scalable, semi-supervised avenues (e.g., semi-supervised labeling on the GUM corpus) to further enhance EDU segmentation. Overall, the method demonstrates that simple features can rival heavier neural approaches in EDU segmentation and improve downstream parsing outcomes, with practical implications for efficiency and resource-constrained settings.

Abstract

Segmenting text into Elemental Discourse Units (EDUs) is a fundamental task in discourse parsing. We present a new simple method for identifying EDU boundaries, and hence segmenting them, based on lexical and character n-gram features, using random forest classification. We show that the method, despite its simplicity, outperforms other methods both for segmentation and within a state of the art discourse parser. This indicates the importance of such features for identifying basic discourse elements, pointing towards potentially more training-efficient methods for discourse analysis.

ESURF: Simple and Effective EDU Segmentation

TL;DR

This work tackles EDU segmentation as a bottleneck in RST discourse parsing by introducing ESURF, a simple yet effective boundary detector that relies on lexical cues and character n-grams within a 9-token context, learned via random forests. ESURF achieves state-of-the-art segmentation performance on CNN/Daily Mail and RST-DT and yields measurable improvements in a transition-based RST parser, underscoring the value of lightweight, training-efficient features for discourse analysis. The findings highlight the sufficiency of local lexical/morphological information for identifying basic discourse units and point to scalable, semi-supervised avenues (e.g., semi-supervised labeling on the GUM corpus) to further enhance EDU segmentation. Overall, the method demonstrates that simple features can rival heavier neural approaches in EDU segmentation and improve downstream parsing outcomes, with practical implications for efficiency and resource-constrained settings.

Abstract

Segmenting text into Elemental Discourse Units (EDUs) is a fundamental task in discourse parsing. We present a new simple method for identifying EDU boundaries, and hence segmenting them, based on lexical and character n-gram features, using random forest classification. We show that the method, despite its simplicity, outperforms other methods both for segmentation and within a state of the art discourse parser. This indicates the importance of such features for identifying basic discourse elements, pointing towards potentially more training-efficient methods for discourse analysis.
Paper Structure (9 sections, 1 figure, 3 tables)