Evaluating Discourse Cohesion in Pre-trained Language Models
Jie He, Wanqiu Long, Deyi Xiong
TL;DR
The paper addresses the lack of systematic evaluation of discourse cohesion in pre-trained language models by introducing a multi-phenomenon test suite that covers lexical and grammatical cohesion across adjacent and non-adjacent sentences. It evaluates six models using masked-word prediction to measure coherence generation and context utilization, and analyzes attention patterns to interpret model behavior. Key findings show RoBERTa generally outperforms BERT and BART, certain cohesion types remain challenging, and context significantly influences target word generation. The work provides a standardized benchmark for discourse cohesion and offers insights to guide future improvements toward global cohesion in language models.
Abstract
Large pre-trained neural models have achieved remarkable success in natural language process (NLP), inspiring a growing body of research analyzing their ability from different aspects. In this paper, we propose a test suite to evaluate the cohesive ability of pre-trained language models. The test suite contains multiple cohesion phenomena between adjacent and non-adjacent sentences. We try to compare different pre-trained language models on these phenomena and analyze the experimental results,hoping more attention can be given to discourse cohesion in the future.
