Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch
TL;DR
This work tackles the challenge of making LLM-based language practice accessible to absolute beginners by introducing modular, inference-time difficulty control. It evaluates Baseline prompting, Overgenerate, and FUDGE (Future Discriminators for Generation) on Japanese JLPT-level dialogues through both automatic metrics and a human study, showing that FUDGE substantially improves learner comprehensibility while preserving fluency (notably reducing Token Miss Rate from 39.4% to 83.3% understandability). A new Token Miss Rate metric is proposed to quantify incomprehensible tokens and correlates strongly with human judgments, enabling scalable evaluation. The results suggest that difficulty-controlled LLMs can serve as practical, on-device-friendly language partners for beginners, with implications for scalable, inclusive language learning tools. Limitations include language specificity, the difficulty-definition's subjectivity, and TMR's reliance on exact token binning; future work may broaden to other languages and adopt semantic-aware evaluation. The authors release code, models, annotation tools, and data to foster further AI-assisted language learning research.
Abstract
Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for first and second-year beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.
