Table of Contents
Fetching ...

DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting

Xuanming Zhang, Anthony Diaz, Zixun Chen, Qingyang Wu, Kun Qian, Erik Voss, Zhou Yu

TL;DR

DECOR introduces a novel benchmark for improving coherence in L2 English writing by jointly addressing incoherence detection, reasoning about underlying causes, and rewriting the incoherent sentences. Built from TOEFL-11 sentences, DECOR provides 1,352 context-sentence pairs and 213 expert rewrites, with expert annotations covering semantic connection, entity references, discourse relations, consistency, and relevance. The work demonstrates that task-specific synthetic data enable smaller models (e.g., DeBERTa-base, Llama2-7B) to reach or approach GPT-4 performance on detection and rewriting tasks, and that reasoning guidance consistently improves rewrite quality. Through comprehensive annotation schemes, inter-annotator agreement, and both automatic and human evaluations, DECOR offers a first-of-its-kind resource to evaluate and improve coherence in L2 writing with practical implications for automated writing evaluation and feedback tools.

Abstract

Coherence in writing, an aspect that second-language (L2) English learners often struggle with, is crucial in assessing L2 English writing. Existing automated writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifically designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts. Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific reasons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a result that is favored in both automatic and human evaluations.

DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting

TL;DR

DECOR introduces a novel benchmark for improving coherence in L2 English writing by jointly addressing incoherence detection, reasoning about underlying causes, and rewriting the incoherent sentences. Built from TOEFL-11 sentences, DECOR provides 1,352 context-sentence pairs and 213 expert rewrites, with expert annotations covering semantic connection, entity references, discourse relations, consistency, and relevance. The work demonstrates that task-specific synthetic data enable smaller models (e.g., DeBERTa-base, Llama2-7B) to reach or approach GPT-4 performance on detection and rewriting tasks, and that reasoning guidance consistently improves rewrite quality. Through comprehensive annotation schemes, inter-annotator agreement, and both automatic and human evaluations, DECOR offers a first-of-its-kind resource to evaluate and improve coherence in L2 writing with practical implications for automated writing evaluation and feedback tools.

Abstract

Coherence in writing, an aspect that second-language (L2) English learners often struggle with, is crucial in assessing L2 English writing. Existing automated writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifically designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts. Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific reasons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a result that is favored in both automatic and human evaluations.
Paper Structure (62 sections, 5 figures, 20 tables)

This paper contains 62 sections, 5 figures, 20 tables.

Figures (5)

  • Figure 1: The overview of DECOR, containing three tasks: incoherence detection, reasoning, and rewriting. An example human rewrite is generated for the given context-sentence pair. GPT-4 rewrite is unacceptable since it generates more invasive and unnecessary changes.
  • Figure 2: Distribution of specific reasons for incoherence, and those clustered into groups.
  • Figure 3: Human expert as a judge evaluation results with GPT-4 rewrites as the baseline. We sample 100 examples and ask our human expert for each pair of comparisons. A higher win rate and a lower loss rate indicate superior quality.
  • Figure 4: The number of words per rewrite.
  • Figure 5: Distribution of essays by number of sentences and number of words.