Table of Contents
Fetching ...

A Novel Computational and Modeling Foundation for Automatic Coherence Assessment

Aviya Maimon, Reut Tsarfaty

TL;DR

The paper tackles the lack of formal, scalable coherence metrics in NLP by grounding coherence in Reinhart's three conditions: cohesion, consistency, and relevance. It operationalizes these notions into five proxy tasks (SRO, DRR, NPE, NLI, ISR) and trains a multi-task learning (MTL) model to capture their shared signals, evaluated on real-world (GCDC) and generated (CoheSentia) datasets. The results show that joint training yields state-of-the-art performance on most coherence proxy tasks and improves coherence scoring and reasoning across both domains, with notable cross-domain transfer. This framework provides a solid, scalable foundation for automatic coherence assessment and has potential to enhance generation and evaluation in large-language-model workflows.

Abstract

Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another. In the era of generative AI, coherence assessment is essential for many NLP tasks; summarization, generation, long-form question-answering, and more. However, in NLP {coherence} is an ill-defined notion, not having a formal definition or evaluation metrics, that would allow for large-scale automatic and systematic coherence assessment. To bridge this gap, in this work we employ the formal linguistic definition of \citet{Reinhart:1980} of what makes a discourse coherent, consisting of three conditions -- {\em cohesion, consistency} and {\em relevance} -- and formalize these conditions as respective computational tasks. We hypothesize that (i) a model trained on all of these tasks will learn the features required for coherence detection, and that (ii) a joint model for all tasks will exceed the performance of models trained on each task individually. On two benchmarks for coherence scoring rated by humans, one containing 500 automatically-generated short stories and another containing 4k real-world texts, our experiments confirm that jointly training on the proposed tasks leads to better performance on each task compared with task-specific models, and to better performance on assessing coherence overall, compared with strong baselines. We conclude that the formal and computational setup of coherence as proposed here provides a solid foundation for advanced methods of large-scale automatic assessment of coherence.

A Novel Computational and Modeling Foundation for Automatic Coherence Assessment

TL;DR

The paper tackles the lack of formal, scalable coherence metrics in NLP by grounding coherence in Reinhart's three conditions: cohesion, consistency, and relevance. It operationalizes these notions into five proxy tasks (SRO, DRR, NPE, NLI, ISR) and trains a multi-task learning (MTL) model to capture their shared signals, evaluated on real-world (GCDC) and generated (CoheSentia) datasets. The results show that joint training yields state-of-the-art performance on most coherence proxy tasks and improves coherence scoring and reasoning across both domains, with notable cross-domain transfer. This framework provides a solid, scalable foundation for automatic coherence assessment and has potential to enhance generation and evaluation in large-language-model workflows.

Abstract

Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another. In the era of generative AI, coherence assessment is essential for many NLP tasks; summarization, generation, long-form question-answering, and more. However, in NLP {coherence} is an ill-defined notion, not having a formal definition or evaluation metrics, that would allow for large-scale automatic and systematic coherence assessment. To bridge this gap, in this work we employ the formal linguistic definition of \citet{Reinhart:1980} of what makes a discourse coherent, consisting of three conditions -- {\em cohesion, consistency} and {\em relevance} -- and formalize these conditions as respective computational tasks. We hypothesize that (i) a model trained on all of these tasks will learn the features required for coherence detection, and that (ii) a joint model for all tasks will exceed the performance of models trained on each task individually. On two benchmarks for coherence scoring rated by humans, one containing 500 automatically-generated short stories and another containing 4k real-world texts, our experiments confirm that jointly training on the proposed tasks leads to better performance on each task compared with task-specific models, and to better performance on assessing coherence overall, compared with strong baselines. We conclude that the formal and computational setup of coherence as proposed here provides a solid foundation for advanced methods of large-scale automatic assessment of coherence.
Paper Structure (63 sections, 3 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 63 sections, 3 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the encoder-only model where the input is a pair of sentences (most tasks) or for NPE task the input is a document and the token IDs of different NPs
  • Figure 2: Accuracy for Coherence Scoring Task for both GCDC and CoheSentia with different proxy coherence task-subsets. The labels are tasks IDs (1-SRO, 2-ISR, 3-DRR, 4-NPE, 5-NLI)
  • Figure 3: Results for SRO task, for different subsets of coherence tasks fine-tuned upon. The labels are the number of tasks and in curly brackets which tasks (1 - SRO, 2 - ISR, 3 - DRR, 4 - NPE, 5 - NLI)
  • Figure 4: Illustration of the token head which contains several stages: starting with (1) embedding for each token in the text, (2) creating an embedding for each NP when it acts as the complement and the anchor separately, (3) a representation for each NP pair and finally (4) a classification layer
  • Figure 5: Distribution of the main prepositions in the NP Enrichment test set
  • ...and 5 more figures