Table of Contents
Fetching ...

Conversations as a Source for Teaching Scientific Concepts at Different Education Levels

Donya Rooein, Dirk Hovy

TL;DR

The paper tackles the problem of training conversational teaching agents to address diverse education levels by introducing the 5-Levels dataset, derived from WIRED video transcripts that capture instructor–learner dialogues across child to expert audiences. It provides a rich, metadata-enabled resource (125 level-specific conversations, ~570 minutes, ~102k words, ~2.9k turns involving ~150 interlocutors) and analyzes these data using Conversation Analysis and readability metrics to reveal how instruction adapts with audience expertise. Findings show a clear shift from instructor-dominated talk at lower levels to more balanced or learner-dominated dialogue at higher levels, with presentation styles and analogies evolving accordingly. The work offers a practical asset for training and evaluating adaptive educational language models and emphasizes ethical data usage and public availability to support future research in automated, level-appropriate science communication.

Abstract

Open conversations are one of the most engaging forms of teaching. However, creating those conversations in educational software is a complex endeavor, especially if we want to address the needs of different audiences. While language models hold great promise for educational applications, there are substantial challenges in training them to engage in meaningful and effective conversational teaching, especially when considering the diverse needs of various audiences. No official data sets exist for this task to facilitate the training of language models for conversational teaching, considering the diverse needs of various audiences. This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels (from preschooler to expert), namely dialogues taken from video transcripts. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate and natural responses to scientific topics for specific target audiences. It is a freely available valuable resource for training and evaluating conversation models, encompassing organically occurring dialogues. While the raw data is available online, we provide additional metadata for conversational analysis of dialogues at each level in all available videos.

Conversations as a Source for Teaching Scientific Concepts at Different Education Levels

TL;DR

The paper tackles the problem of training conversational teaching agents to address diverse education levels by introducing the 5-Levels dataset, derived from WIRED video transcripts that capture instructor–learner dialogues across child to expert audiences. It provides a rich, metadata-enabled resource (125 level-specific conversations, ~570 minutes, ~102k words, ~2.9k turns involving ~150 interlocutors) and analyzes these data using Conversation Analysis and readability metrics to reveal how instruction adapts with audience expertise. Findings show a clear shift from instructor-dominated talk at lower levels to more balanced or learner-dominated dialogue at higher levels, with presentation styles and analogies evolving accordingly. The work offers a practical asset for training and evaluating adaptive educational language models and emphasizes ethical data usage and public availability to support future research in automated, level-appropriate science communication.

Abstract

Open conversations are one of the most engaging forms of teaching. However, creating those conversations in educational software is a complex endeavor, especially if we want to address the needs of different audiences. While language models hold great promise for educational applications, there are substantial challenges in training them to engage in meaningful and effective conversational teaching, especially when considering the diverse needs of various audiences. No official data sets exist for this task to facilitate the training of language models for conversational teaching, considering the diverse needs of various audiences. This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels (from preschooler to expert), namely dialogues taken from video transcripts. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate and natural responses to scientific topics for specific target audiences. It is a freely available valuable resource for training and evaluating conversation models, encompassing organically occurring dialogues. While the raw data is available online, we provide additional metadata for conversational analysis of dialogues at each level in all available videos.
Paper Structure (7 sections, 1 figure, 3 tables)

This paper contains 7 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Examples of conversations of an instructor with a child and expert learners.