Table of Contents
Fetching ...

Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

TL;DR

Do Tutors Learn from Equity Training and Can Generative AI Assess It? investigates whether a scenario-based equity-training lesson improves tutors' equity-responsive tutoring and whether large language models can reliably assess open-ended tutor responses. Using 81 undergraduate tutors and a mixed-method design, the study finds marginal learning gains in self-reported knowledge and confidence, with GPT-4o and GPT-4-turbo capable of assessing tutor actions; GPT-4o with few-shot prompting offers the best trade-off for scalable evaluation. A dataset of lesson logs, human annotations, and AI prompts is released to support transparency and replication in learning analytics. Limitations include sample size, post-survey nonresponse, and some AI-human scoring misalignments, guiding future work toward more balanced scenarios, transfer testing, and robust truth sources.

Abstract

Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.

Do Tutors Learn from Equity Training and Can Generative AI Assess It?

TL;DR

Do Tutors Learn from Equity Training and Can Generative AI Assess It? investigates whether a scenario-based equity-training lesson improves tutors' equity-responsive tutoring and whether large language models can reliably assess open-ended tutor responses. Using 81 undergraduate tutors and a mixed-method design, the study finds marginal learning gains in self-reported knowledge and confidence, with GPT-4o and GPT-4-turbo capable of assessing tutor actions; GPT-4o with few-shot prompting offers the best trade-off for scalable evaluation. A dataset of lesson logs, human annotations, and AI prompts is released to support transparency and replication in learning analytics. Limitations include sample size, post-survey nonresponse, and some AI-human scoring misalignments, guiding future work toward more balanced scenarios, transfer testing, and robust truth sources.

Abstract

Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.

Paper Structure

This paper contains 26 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The modified predict-observe-explain cycle for the pretest and posttest scenarios.
  • Figure 2: The scenario involving student Jeremiah with the open-ended question prompting a tutor to predict the best approach.
  • Figure 3: The scenario involving student Alexis with the open-ended question prompting a tutor to predict the best approach.
  • Figure 4: Mean pretest and posttest scores between scenario order conditions and measurement points.
  • Figure 5: Average open response scores (2 pts total) at pretest and posttest by scenario for each LLM model and human graders.
  • ...and 1 more figures