Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence
Yinhong Liu, Yixuan Su, Ehsan Shareghi, Nigel Collier
TL;DR
The paper addresses the challenge of evaluating discourse coherence in long-form text, where traditional metrics miss underlying structure. It introduces Positional Discourse Divergence (PDD), a simple, model-free metric that partitions articles into $N$ positional bins and compares the discourse-role distributions via KL divergence, with $D_{pos} = \frac{1}{N} \sum_{n=1}^{N} D_{KL}(p^n(r) + \epsilon \,||\, q^n(r) + \epsilon)$. Across News Discourse, LFQA, and Recipe1M+ datasets, PDD shows higher agreement with human judgments and GPT-4 coherence evaluations than baselines like exact match, BLEU, and ROUGE-L, while revealing the impact of binning on sensitivity. The approach relies on a discourse classifier and an appropriate choice of bin count $N$, which may vary by domain, but provides a practical tool for assessing discourse structure in long-form generation.
Abstract
Recent large language models (LLMs) have shown remarkable performance in aligning generated text with user intentions across various tasks. When it comes to long-form text generation, there has been a growing interest in generation from a discourse coherence perspective. However, existing lexical or semantic metrics such as BLEU, ROUGE, BertScore cannot effectively capture the discourse coherence. The development of discourse-specific automatic evaluation methods for assessing the output of LLMs warrants greater focus and exploration. In this paper, we present a novel automatic metric designed to quantify the discourse divergence between two long-form articles. Extensive experiments on three datasets from representative domains demonstrate that our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
