ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Jeiyoon Park; Chanjun Park; Heuiseok Lim

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Jeiyoon Park, Chanjun Park, Heuiseok Lim

TL;DR

The paper tackles the data bottleneck in grammatical error correction by proposing ChatLang-8, a framework that uses an automated four-component pipeline (Subject Selector, Grammar Selector, Prompt Manager, Evaluator) to generate high-quality, diverse, human-like GEC data at scale. It introduces ChatLang-8, a 1M-pair dataset spanning eight subject types and 23 grammar types, generated with GPT-3.5 Turbo and filtered by a four-criterion Evaluator. Empirical results show that models trained on ChatLang-8 achieve higher recall and F$_{0.5}$ on standard GEC benchmarks than those trained on Lang-8, with more uniform error distributions and fewer trivial label issues. The work demonstrates that careful control over data generation and quality assessment can substantially improve GEC learning and offers practical resources for enhancing LLM-driven data generation workflows.

Abstract

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

TL;DR

on standard GEC benchmarks than those trained on Lang-8, with more uniform error distributions and fewer trivial label issues. The work demonstrates that careful control over data generation and quality assessment can substantially improve GEC learning and offers practical resources for enhancing LLM-driven data generation workflows.

Abstract

Paper Structure (14 sections, 1 equation, 5 figures, 5 tables)

This paper contains 14 sections, 1 equation, 5 figures, 5 tables.

Introduction
ChatLang-8
Subject Selector
Grammar Selector
Prompt Manager
Evaluator
Dataset Details
Experiments
Statistical Analysis
Quantitative Results
Qualitative Results
Conclusion
Statistical Analysis: Error Type Distributions
Architecture

Figures (5)

Figure 1: The pipeline of Evaluator.
Figure 2: Comparison of M$^2$ outputs of GEC models, trained on Lang-8 and ChatLang-8 respectively.
Figure 3: Subject distribution of ChatLang-8.
Figure 4: Distributions of the ChatLang-8 and the other datasets with respect to 25 grammatical errors.
Figure 5: Schematic depiction of overall architecture

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

TL;DR

Abstract

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)