Table of Contents
Fetching ...

"How Do I ...?": Procedural Questions Predominate Student-LLM Chatbot Conversations

Alexandra Neagu, Marcus Messer, Peter Johnson, Rhodri Nelson

TL;DR

This paper investigates how student questions drive scaffolding in LLM-powered educational chatbots across two learning contexts (formative self-study and summative coursework). It evaluates whether four established question-classification schemas can be applied at scale using 11 LLM-raters and 3 human raters on 6,113 messages, with a two-stage filtering process to identify questions. Results show moderate-to-good agreement among LLM-raters and higher reliability for larger models, but note that current schemas fail to capture the semantic richness of composite prompts, with procedural questions predominating in both contexts, especially during summative tasks. The authors argue for task-specific schemas and the integration of conversation-analysis methods to better characterize student–LLM interactions and enable scalable, nuanced classification.

Abstract

Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot's response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that 'procedural' questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.

"How Do I ...?": Procedural Questions Predominate Student-LLM Chatbot Conversations

TL;DR

This paper investigates how student questions drive scaffolding in LLM-powered educational chatbots across two learning contexts (formative self-study and summative coursework). It evaluates whether four established question-classification schemas can be applied at scale using 11 LLM-raters and 3 human raters on 6,113 messages, with a two-stage filtering process to identify questions. Results show moderate-to-good agreement among LLM-raters and higher reliability for larger models, but note that current schemas fail to capture the semantic richness of composite prompts, with procedural questions predominating in both contexts, especially during summative tasks. The authors argue for task-specific schemas and the integration of conversation-analysis methods to better characterize student–LLM interactions and enable scalable, nuanced classification.

Abstract

Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot's response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that 'procedural' questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.
Paper Structure (19 sections, 2 figures, 4 tables)

This paper contains 19 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Leave-One-Out analysis of inter-rater reliability across 11 LLM-raters for FormativeChat (left) and SummativeChat (right). Bars represent the change in inter-rater agreement coefficients when a specific rater is iteratively removed. An increase identifies the removed rater as a source of disagreement, while a decrease indicates the rater contributed positively to the agreement.
  • Figure 2: Distribution of student question types for FormativeChat (light) and SummativeChat (dark), calculated as the mean across 11 LLM-raters for four schemas. Error bars represent standard deviation across models. Results illustrate the prevalence of procedural questions across both learning contexts.