CoUDA: Coherence Evaluation via Unified Data Augmentation
Dawei Zhu, Wenhao Wu, Yifan Song, Fangwei Zhu, Ziqiang Cao, Sujian Li
TL;DR
CoUDA tackles coherence evaluation by addressing both global discourse organization and local sentence transitions under data scarcity. It introduces a unified data augmentation framework consisting of global shuffling and a novel generative local augmentor with context truncation and coherence filtering, plus a unified scoring mechanism that combines global and local cues. With a compact model of 233M parameters, CoUDA achieves state-of-the-art correlations on SummEval and superior pairwise ranking on INSteD, often outperforming GPT-4-based metrics in the pointwise setting. The approach offers a practical, linguistically informed, and efficient solution for robust discourse coherence assessment in summarization and related tasks.
Abstract
Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance. In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively. Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse. Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics.
