A Novel Computational and Modeling Foundation for Automatic Coherence Assessment
Aviya Maimon, Reut Tsarfaty
TL;DR
The paper tackles the lack of formal, scalable coherence metrics in NLP by grounding coherence in Reinhart's three conditions: cohesion, consistency, and relevance. It operationalizes these notions into five proxy tasks (SRO, DRR, NPE, NLI, ISR) and trains a multi-task learning (MTL) model to capture their shared signals, evaluated on real-world (GCDC) and generated (CoheSentia) datasets. The results show that joint training yields state-of-the-art performance on most coherence proxy tasks and improves coherence scoring and reasoning across both domains, with notable cross-domain transfer. This framework provides a solid, scalable foundation for automatic coherence assessment and has potential to enhance generation and evaluation in large-language-model workflows.
Abstract
Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another. In the era of generative AI, coherence assessment is essential for many NLP tasks; summarization, generation, long-form question-answering, and more. However, in NLP {coherence} is an ill-defined notion, not having a formal definition or evaluation metrics, that would allow for large-scale automatic and systematic coherence assessment. To bridge this gap, in this work we employ the formal linguistic definition of \citet{Reinhart:1980} of what makes a discourse coherent, consisting of three conditions -- {\em cohesion, consistency} and {\em relevance} -- and formalize these conditions as respective computational tasks. We hypothesize that (i) a model trained on all of these tasks will learn the features required for coherence detection, and that (ii) a joint model for all tasks will exceed the performance of models trained on each task individually. On two benchmarks for coherence scoring rated by humans, one containing 500 automatically-generated short stories and another containing 4k real-world texts, our experiments confirm that jointly training on the proposed tasks leads to better performance on each task compared with task-specific models, and to better performance on assessing coherence overall, compared with strong baselines. We conclude that the formal and computational setup of coherence as proposed here provides a solid foundation for advanced methods of large-scale automatic assessment of coherence.
