Table of Contents
Fetching ...

Grammar Control in Dialogue Response Generation for Language Learning Chatbots

Dominik Glandorf, Peng Cui, Detmar Meurers, Mrinmaya Sachan

TL;DR

This work tackles grounding dialogue generation in pedagogy by controlling grammar forms through the English Grammar Profile (EGP) within CEFR-guided categories. It evaluates prompting, fine-tuning, and decoding strategies, finding guided decoding with Llama3 provides the best trade-off between grammar constraint satisfaction and response quality, achieving ~59.3% form inclusion while maintaining grammatical correctness comparable to GPT-3.5. A learner-proficiency simulation suggests that grammar-controlled input can boost learners' production of target forms, supporting adaptive practice across CEFR levels. The study highlights practical gains for language-learning chatbots and outlines avenues for real-teacher validation and expanded grammar inventories.

Abstract

Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners' current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.

Grammar Control in Dialogue Response Generation for Language Learning Chatbots

TL;DR

This work tackles grounding dialogue generation in pedagogy by controlling grammar forms through the English Grammar Profile (EGP) within CEFR-guided categories. It evaluates prompting, fine-tuning, and decoding strategies, finding guided decoding with Llama3 provides the best trade-off between grammar constraint satisfaction and response quality, achieving ~59.3% form inclusion while maintaining grammatical correctness comparable to GPT-3.5. A learner-proficiency simulation suggests that grammar-controlled input can boost learners' production of target forms, supporting adaptive practice across CEFR levels. The study highlights practical gains for language-learning chatbots and outlines avenues for real-teacher validation and expanded grammar inventories.

Abstract

Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners' current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.

Paper Structure

This paper contains 34 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Example conversation of the pedagogically motivated chatbot. Based on the current learner level, the chatbot includes grammar skills in its response, and the learner replicates the advanced grammar structure in their response.
  • Figure 2: Precision distribution on the same (validation) and an unseen (test) corpus of 53 grammar skill detectors trained on three different datasets.
  • Figure 3: Strategy performance comparison for generating responses that include grammar skills from a category and on a specified level (N=1500, single run).
  • Figure 4: Grammar skill tuples with a significantly increased target skill are indicated in red for simulated learners on the specified proficiency level (x-axis). Grey boxes reflect pairs not expected to be significant due to too difficult target grammar skills. Diagram with y-axis labels and odds ratios in Appendix \ref{['app:learner_sim']}.
  • Figure 5: Example dialogue of DailySum.
  • ...and 6 more figures