Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

Megan Gu; Chloe Qianhui Zhao; Claire Liu; Nikhil Patel; Jahnvi Shah; Jionghao Lin; Kenneth R. Koedinger

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

Megan Gu, Chloe Qianhui Zhao, Claire Liu, Nikhil Patel, Jahnvi Shah, Jionghao Lin, Kenneth R. Koedinger

TL;DR

This paper investigates automated qualitative analysis of tutoring dialogues by using GPT-3.5 with few-shot chain-of-thought prompting to classify and assess five tutoring strategies in transcripts. It builds a Tutor Dialogue Classification system and evaluates it on the Teacher-Student Chatroom Corpus with labeled examples, reporting metrics that show moderate strength in excluding incorrect labels but limited ability to consistently identify the correct strategy. The findings highlight the potential of LLMs for tutoring-dialogue analysis while identifying opportunities for improvement, such as adopting more advanced models and richer feedback mechanisms. The work lays groundwork for scalable, automated feedback on tutoring practices and guides future enhancements in model capability and evaluation depth.

Abstract

Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

TL;DR

Abstract

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)