Table of Contents
Fetching ...

Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence

Yinhong Liu, Yixuan Su, Ehsan Shareghi, Nigel Collier

TL;DR

The paper addresses the challenge of evaluating discourse coherence in long-form text, where traditional metrics miss underlying structure. It introduces Positional Discourse Divergence (PDD), a simple, model-free metric that partitions articles into $N$ positional bins and compares the discourse-role distributions via KL divergence, with $D_{pos} = \frac{1}{N} \sum_{n=1}^{N} D_{KL}(p^n(r) + \epsilon \,||\, q^n(r) + \epsilon)$. Across News Discourse, LFQA, and Recipe1M+ datasets, PDD shows higher agreement with human judgments and GPT-4 coherence evaluations than baselines like exact match, BLEU, and ROUGE-L, while revealing the impact of binning on sensitivity. The approach relies on a discourse classifier and an appropriate choice of bin count $N$, which may vary by domain, but provides a practical tool for assessing discourse structure in long-form generation.

Abstract

Recent large language models (LLMs) have shown remarkable performance in aligning generated text with user intentions across various tasks. When it comes to long-form text generation, there has been a growing interest in generation from a discourse coherence perspective. However, existing lexical or semantic metrics such as BLEU, ROUGE, BertScore cannot effectively capture the discourse coherence. The development of discourse-specific automatic evaluation methods for assessing the output of LLMs warrants greater focus and exploration. In this paper, we present a novel automatic metric designed to quantify the discourse divergence between two long-form articles. Extensive experiments on three datasets from representative domains demonstrate that our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.

Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence

TL;DR

The paper addresses the challenge of evaluating discourse coherence in long-form text, where traditional metrics miss underlying structure. It introduces Positional Discourse Divergence (PDD), a simple, model-free metric that partitions articles into positional bins and compares the discourse-role distributions via KL divergence, with . Across News Discourse, LFQA, and Recipe1M+ datasets, PDD shows higher agreement with human judgments and GPT-4 coherence evaluations than baselines like exact match, BLEU, and ROUGE-L, while revealing the impact of binning on sensitivity. The approach relies on a discourse classifier and an appropriate choice of bin count , which may vary by domain, but provides a practical tool for assessing discourse structure in long-form generation.

Abstract

Recent large language models (LLMs) have shown remarkable performance in aligning generated text with user intentions across various tasks. When it comes to long-form text generation, there has been a growing interest in generation from a discourse coherence perspective. However, existing lexical or semantic metrics such as BLEU, ROUGE, BertScore cannot effectively capture the discourse coherence. The development of discourse-specific automatic evaluation methods for assessing the output of LLMs warrants greater focus and exploration. In this paper, we present a novel automatic metric designed to quantify the discourse divergence between two long-form articles. Extensive experiments on three datasets from representative domains demonstrate that our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
Paper Structure (22 sections, 1 equation, 3 figures, 1 table)

This paper contains 22 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: A news article example with discourse role annotations. The discourse schema follows the News discourse theory by van2013news.
  • Figure 2: Positional discourse distribution comparisons (N=5). Top row: The discourse distribution of model predictions on News Discourse test set (Llama2-7b, finetuned on Kaggle All the News). Bottom row: Test set reference distributions.
  • Figure 3: Positional Discourse Divergence vs. Bin number ($N$) for predictions by two language models on the News Discourse test set. Training details in Appendix \ref{['appen:sft']}. Curves represent best-fit quadratic curves.