RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

Darya Kharlamova; Irina Proskurina

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

Darya Kharlamova, Irina Proskurina

TL;DR

This work proposes a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns, and finds that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.

Abstract

Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

TL;DR

Abstract

Paper Structure (39 sections, 2 figures, 7 tables)

This paper contains 39 sections, 2 figures, 7 tables.

Introduction
Related Work
L1-Interference Error Detection
Data Augmentation for GEC and GED
Corpus of L1-Interference Errors
L1-Interference Annotation Scheme
Copying Expression
Synonyms
Tense Semantics
Transliteration
Word Form Transmission
Data Generation
PPO-based Generation
Model Set Up for PPO
Reward Models
...and 24 more sections

Figures (2)

Figure 1: Pairwise inter-annotator agreement (Cohen's kappa) for the REALEC-L1 data annotation.
Figure 2: Error distribution in RILEC, including synthetic (S) and real (R) data. CopExp = Copying Expression; WFT = Word Form Transmission; TenSem = Tense Semantics.

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

TL;DR

Abstract

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (2)