Table of Contents
Fetching ...

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction

Jingheng Ye, Shang Qin, Yinghui Li, Xuxin Cheng, Libo Qin, Hai-Tao Zheng, Ying Shen, Peng Xing, Zishan Xu, Guo Cheng, Wenhao Jiang

TL;DR

This work introduces EXGEC, a unified task that couples grammatical error correction with explainable reasoning for Chinese text, and presents EXCGEC, the first Chinese EXGEC benchmark with 8,216 explanation-augmented samples featuring hybrid edit-wise explanations. It develops multi-task baselines in post-explaining and pre-explaining forms, introduces the COTE decoding strategy to improve alignment between corrections and explanations, and verifies findings with both automatic metrics and human judgments. The results show that post-explaining models generally outperform pre-explaining ones and that pipeline approaches can outperform multi-task models, highlighting the challenges of joint learning for explainable GEC. This benchmark provides a foundation for evaluating explainability in GEC, guides future multi-task modeling for education-focused NLP, and offers a framework for robust automatic metrics and human evaluation of free-text explanations.

Abstract

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction

TL;DR

This work introduces EXGEC, a unified task that couples grammatical error correction with explainable reasoning for Chinese text, and presents EXCGEC, the first Chinese EXGEC benchmark with 8,216 explanation-augmented samples featuring hybrid edit-wise explanations. It develops multi-task baselines in post-explaining and pre-explaining forms, introduces the COTE decoding strategy to improve alignment between corrections and explanations, and verifies findings with both automatic metrics and human judgments. The results show that post-explaining models generally outperform pre-explaining ones and that pipeline approaches can outperform multi-task models, highlighting the challenges of joint learning for explainable GEC. This benchmark provides a foundation for evaluating explainability in GEC, guides future multi-task modeling for education-focused NLP, and offers a framework for robust automatic metrics and human evaluation of free-text explanations.

Abstract

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.
Paper Structure (51 sections, 9 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 51 sections, 9 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Task definitions of GEC, GEE, and EXGEC. We highlight 【evidence words】, {correction}, linguistic knowledge, error causes, and revision advice parts.
  • Figure 2: Overview of the benchmark and the model. We show the inference process of a post-explaining model in particular.
  • Figure 3: Distribution of 7 kinds of LLM errors.
  • Figure 4: Examples of error types.
  • Figure 5: Examples of error types.
  • ...and 4 more figures