Table of Contents
Fetching ...

LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning

Tao Fang, Derek F. Wong, Lusheng Zhang, Keyan Jin, Qiang Zhang, Tianjiao Li, Jinlong Hou, Lidia S. Chao

TL;DR

This work introduces an LLM-based curriculum learning (CL) framework for grammatical error correction (GEC), leveraging LLaMA2-70b as an expert to score training data difficulty and forming easy, medium, and hard datasets. The methodology sequentially trains models (T5 and LLaMA series) from easy to hard while reintegrating earlier data to prevent forgetting, achieving significant improvements on CoNLL14 and BEA19 benchmarks over baselines and prior CL methods. Key findings show that easy-to-hard progression is essential for GEC, with LLM-based CL offering more robust gains than Len-based CL and showing particular strength in correcting diverse error types. The results suggest that data-difficulty-guided curriculum strategies based on powerful LLMs can substantially enhance domain-specific tasks and may be extensible to other NLP challenges.

Abstract

While large-scale language models (LLMs) have demonstrated remarkable capabilities in specific natural language processing (NLP) tasks, they may still lack proficiency compared to specialized models in certain domains, such as grammatical error correction (GEC). Drawing inspiration from the concept of curriculum learning, we have delved into refining LLMs into proficient GEC experts by devising effective curriculum learning (CL) strategies. In this paper, we introduce a novel approach, termed LLM-based curriculum learning, which capitalizes on the robust semantic comprehension and discriminative prowess inherent in LLMs to gauge the complexity of GEC training data. Unlike traditional curriculum learning techniques, our method closely mirrors human expert-designed curriculums. Leveraging the proposed LLM-based CL method, we sequentially select varying levels of curriculums ranging from easy to hard, and iteratively train and refine using the pretrianed T5 and LLaMA series models. Through rigorous testing and analysis across diverse benchmark assessments in English GEC, including the CoNLL14 test, BEA19 test, and BEA19 development sets, our approach showcases a significant performance boost over baseline models and conventional curriculum learning methodologies.

LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning

TL;DR

This work introduces an LLM-based curriculum learning (CL) framework for grammatical error correction (GEC), leveraging LLaMA2-70b as an expert to score training data difficulty and forming easy, medium, and hard datasets. The methodology sequentially trains models (T5 and LLaMA series) from easy to hard while reintegrating earlier data to prevent forgetting, achieving significant improvements on CoNLL14 and BEA19 benchmarks over baselines and prior CL methods. Key findings show that easy-to-hard progression is essential for GEC, with LLM-based CL offering more robust gains than Len-based CL and showing particular strength in correcting diverse error types. The results suggest that data-difficulty-guided curriculum strategies based on powerful LLMs can substantially enhance domain-specific tasks and may be extensible to other NLP challenges.

Abstract

While large-scale language models (LLMs) have demonstrated remarkable capabilities in specific natural language processing (NLP) tasks, they may still lack proficiency compared to specialized models in certain domains, such as grammatical error correction (GEC). Drawing inspiration from the concept of curriculum learning, we have delved into refining LLMs into proficient GEC experts by devising effective curriculum learning (CL) strategies. In this paper, we introduce a novel approach, termed LLM-based curriculum learning, which capitalizes on the robust semantic comprehension and discriminative prowess inherent in LLMs to gauge the complexity of GEC training data. Unlike traditional curriculum learning techniques, our method closely mirrors human expert-designed curriculums. Leveraging the proposed LLM-based CL method, we sequentially select varying levels of curriculums ranging from easy to hard, and iteratively train and refine using the pretrianed T5 and LLaMA series models. Through rigorous testing and analysis across diverse benchmark assessments in English GEC, including the CoNLL14 test, BEA19 test, and BEA19 development sets, our approach showcases a significant performance boost over baseline models and conventional curriculum learning methodologies.

Paper Structure

This paper contains 27 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Different curriculum learning methods exhibit varying degrees of consistency with the standards set by human experts in terms of establishing the difficulty levels of data. Our proposed LLM-based CL approach demonstrates a high level of alignment with these expert-defined criteria.
  • Figure 2: The overall framework of our proposed LLM-based CL method consists of two parts: one involves using a foundation LLM (LLaMA2-70b) to establish courses of different levels, while the other involves following the principles of curriculum learning to perform SFT on the GEC model using the established curriculums.
  • Figure 3: The F$_{0.5}$ scores reflect the performance on a selection of fine-grained error types within the CoNLL14 test set. The percentages provided in brackets represent the distribution of each error type. The findings indicate that our LLM-based CL method can significantly improve performance in correcting a majority of these fine-grained error types, surpassing the efficiency of both the Len-based method and other baseline approaches.
  • Figure 4: Progression of Recall, Precision, and F$_{0.5}$ from Easy to Hard stages on CoNLL14 test set using Len-based CL and LLM-based CL methods on T5-xl and LLaMA2-13b.
  • Figure 5: The designed scoring prompts for the LLaMA2-70b model and evaluated the difficulty of correcting incorrect sentences provided by the model.