Table of Contents
Fetching ...

To Err Is Human, but Llamas Can Learn It Too

Agnes Luhtaru, Taido Purason, Martin Vainikko, Maksym Del, Mark Fishel

TL;DR

This study fine-tune Llama 2-based LMs for error generation and finds that this approach yields synthetic errors akin to human errors, which helps GEC Llama models outperform previous state-of-the-art error correction models.

Abstract

This study explores enhancing grammatical error correction (GEC) through artificial error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2-based LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models with the help of these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT-3.5 and GPT-4) also results in synthetic errors beneficially affecting error generation models.

To Err Is Human, but Llamas Can Learn It Too

TL;DR

This study fine-tune Llama 2-based LMs for error generation and finds that this approach yields synthetic errors akin to human errors, which helps GEC Llama models outperform previous state-of-the-art error correction models.

Abstract

This study explores enhancing grammatical error correction (GEC) through artificial error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2-based LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models with the help of these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT-3.5 and GPT-4) also results in synthetic errors beneficially affecting error generation models.
Paper Structure (24 sections, 3 figures, 13 tables)

This paper contains 24 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Quality of generated errors compared to gold and probabilistic, as shown by GEC results of tuning Llama-based models on same-sized synthetic or human (gold) error sets. GPT-3.5-turbo and GPT-4-turbo errors are generated via prompting, Llama stands for Llama 2-based model fine-tuned on the AEG task.
  • Figure 2: Recall scores for most frequent categories in Estonian EstGEC-L2 test set. The first letter corresponds to the operation type (R - replaced, M - missing, U - unnecessary).
  • Figure 3: Error type count in Estonian based on annotating 100 randomly selected sentences (R - replaced, M - missing, U - unnecessary)