Table of Contents
Fetching ...

Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases

Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau

TL;DR

This paper investigates whether Large Language Models can simulate L2-English dialogues with biases driven by L1 backgrounds. It introduces an information-theoretic evaluation framework and eight linguistic constructs, using the ICNALE dataset to quantify how L1 prompting shapes LLM outputs toward human-like L2 patterns. The study shows that modern LLMs, especially GPT-4o, can replicate L1-dependent patterns across several languages, with model performance varying by L1 and by language pair; L1 knowledge injection generally reduces distributional distance to human data, though some constructs like Quantifiers/Numerals remain challenging. The proposed framework enables systematic analysis of L1 transfer in L2 dialogue generation, offering a pathway to educational applications such as L2 dialogue generation and evaluation, while acknowledging dataset and prompting limitations. Overall, the work demonstrates the potential and boundaries of using LLMs to simulate human L2 dialogue for education and assessment.

Abstract

This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.

Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases

TL;DR

This paper investigates whether Large Language Models can simulate L2-English dialogues with biases driven by L1 backgrounds. It introduces an information-theoretic evaluation framework and eight linguistic constructs, using the ICNALE dataset to quantify how L1 prompting shapes LLM outputs toward human-like L2 patterns. The study shows that modern LLMs, especially GPT-4o, can replicate L1-dependent patterns across several languages, with model performance varying by L1 and by language pair; L1 knowledge injection generally reduces distributional distance to human data, though some constructs like Quantifiers/Numerals remain challenging. The proposed framework enables systematic analysis of L1 transfer in L2 dialogue generation, offering a pathway to educational applications such as L2 dialogue generation and evaluation, while acknowledging dataset and prompting limitations. Overall, the work demonstrates the potential and boundaries of using LLMs to simulate human L2 dialogue for education and assessment.

Abstract

This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.

Paper Structure

This paper contains 39 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Examples of L2 English dialogue from human speakers, which can generally be biased by their native L1 knowledge, e.g., with particular errors.
  • Figure 2: An example for Thai L1 knowledge injection prompting of Speech Acts, we provided full sentences in a complete dialogue context, the utterances were omitted as "..." in this figure
  • Figure 3: Density results for L2 GPT-4o generation dialogue via different L1s where NVC represents noun and verb collocations, TA for tense agreement and NA for number agreement. The blue lines (L2-Generated), orange lines (English-Generated), and green lines (L2-Humans) correspond to LLM-generated dialogue with L1 prompting, that without L1 knowledge injection prompting, and respective human dialogue.
  • Figure 4: Density results of human-baseline dialogues of different L1s, where NVC represents Noun and Verb Collocations, TA for Tense Agreement and NA for Number Agreement.
  • Figure 5: Full density results for L2 generation dialogue via Korean L1s
  • ...and 6 more figures