Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases
Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau
TL;DR
This paper investigates whether Large Language Models can simulate L2-English dialogues with biases driven by L1 backgrounds. It introduces an information-theoretic evaluation framework and eight linguistic constructs, using the ICNALE dataset to quantify how L1 prompting shapes LLM outputs toward human-like L2 patterns. The study shows that modern LLMs, especially GPT-4o, can replicate L1-dependent patterns across several languages, with model performance varying by L1 and by language pair; L1 knowledge injection generally reduces distributional distance to human data, though some constructs like Quantifiers/Numerals remain challenging. The proposed framework enables systematic analysis of L1 transfer in L2 dialogue generation, offering a pathway to educational applications such as L2 dialogue generation and evaluation, while acknowledging dataset and prompting limitations. Overall, the work demonstrates the potential and boundaries of using LLMs to simulate human L2 dialogue for education and assessment.
Abstract
This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.
