Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots
Junling Wang, Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Mrinmaya Sachan
TL;DR
This paper tackles the scarcity of high-quality data for educational chatbots by proposing Book2Dial, a framework that generates synthetic teacher-student dialogues grounded in open textbooks. It formalizes a versatile pipeline with three instantiations—multi-turn QG-QA, dialogue inpainting, and persona-based generation—alongside a concrete quality rubric covering relevance, coherence, informativeness, grounding, answerability, factual consistency, and specificity. Through automatic metrics and human evaluation across multiple textbook domains, the study finds that role-playing LLMs generally yield the strongest overall dialogue quality, albeit with hallucination and repetition challenges, while grounding-based methods excel in informativeness and groundedness. Pre-training educational chatbots on such textbook-derived synthetic data can improve downstream performance when domain alignment with the target task exists, indicating a practical pathway for cost-effective chatbot development. The work also discusses limitations and ethical considerations, emphasizing the need to balance data size and quality and to avoid overreliance on synthetic data in real classrooms.
Abstract
Educational chatbots are a promising tool for assisting student learning. However, the development of effective chatbots in education has been challenging, as high-quality data is seldom available in this domain. In this paper, we propose a framework for generating synthetic teacher-student interactions grounded in a set of textbooks. Our approaches capture one aspect of learning interactions where curious students with partial knowledge interactively ask a teacher questions about the material in the textbook. We highlight various quality criteria that such dialogues should fulfill and compare several approaches relying on either prompting or fine-tuning large language models. We use synthetic dialogues to train educational chatbots and show benefits of further fine-tuning in different educational domains. However, human evaluation shows that our best data synthesis method still suffers from hallucinations and tends to reiterate information from previous conversations. Our findings offer insights for future efforts in synthesizing conversational data that strikes a balance between size and quality. We will open-source our data and code.
