Table of Contents
Fetching ...

Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation

Meiqing Jin, Liam Dugan, Chris Callison-Burch

TL;DR

This work tackles the challenge of making LLM-based language practice accessible to absolute beginners by introducing modular, inference-time difficulty control. It evaluates Baseline prompting, Overgenerate, and FUDGE (Future Discriminators for Generation) on Japanese JLPT-level dialogues through both automatic metrics and a human study, showing that FUDGE substantially improves learner comprehensibility while preserving fluency (notably reducing Token Miss Rate from 39.4% to 83.3% understandability). A new Token Miss Rate metric is proposed to quantify incomprehensible tokens and correlates strongly with human judgments, enabling scalable evaluation. The results suggest that difficulty-controlled LLMs can serve as practical, on-device-friendly language partners for beginners, with implications for scalable, inclusive language learning tools. Limitations include language specificity, the difficulty-definition's subjectivity, and TMR's reliance on exact token binning; future work may broaden to other languages and adopt semantic-aware evaluation. The authors release code, models, annotation tools, and data to foster further AI-assisted language learning research.

Abstract

Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for first and second-year beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation

TL;DR

This work tackles the challenge of making LLM-based language practice accessible to absolute beginners by introducing modular, inference-time difficulty control. It evaluates Baseline prompting, Overgenerate, and FUDGE (Future Discriminators for Generation) on Japanese JLPT-level dialogues through both automatic metrics and a human study, showing that FUDGE substantially improves learner comprehensibility while preserving fluency (notably reducing Token Miss Rate from 39.4% to 83.3% understandability). A new Token Miss Rate metric is proposed to quantify incomprehensible tokens and correlates strongly with human judgments, enabling scalable evaluation. The results suggest that difficulty-controlled LLMs can serve as practical, on-device-friendly language partners for beginners, with implications for scalable, inclusive language learning tools. Limitations include language specificity, the difficulty-definition's subjectivity, and TMR's reliance on exact token binning; future work may broaden to other languages and adopt semantic-aware evaluation. The authors release code, models, annotation tools, and data to foster further AI-assisted language learning research.

Abstract

Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for first and second-year beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

Paper Structure

This paper contains 64 sections, 10 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: We control the difficulty of language‑model outputs using modular difficulty control techniques. This outperforms prompting in both automatic and human evaluations.
  • Figure 2: In the "self-chat" evaluation pipeline (§\ref{['sec:evaluation-pipeline']}) we evaluate our controlled generation methods by simulating conversations between a student LLM and difficulty-controlled tutor LLM. Tutor outputs are evaluated using Token Miss Rate (TMR) (§\ref{['sec:metrics']}) which quantifies the percentage of tokens in an utterance above the target level.
  • Figure 3: After each turn of conversation, participants were asked to highlight on an iPad specific words or phrases they did not understand. We used these annotations to manually compute Token Miss Rate (TMR).
  • Figure 4: Distribution of JLPT/CEFR level of our study participants along with their self-reported average number of hours spoken per week.
  • Figure 5: The voice-based interface used for the human evaluation. Users clicked the microphone icon when they wanted to speak and clicked again when finished.
  • ...and 8 more figures