Table of Contents
Fetching ...

AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar

TL;DR

AITutor-EvalKit presents an open-source toolkit to automatically and interactively evaluate the pedagogical quality of AI tutors using a four-dimension SMR taxonomy. The backend combines a lightweight LoRA-based multi-task learner (LoMTL) with LLM-as-judge evaluations, while the frontend offers automated evaluation, LLM judging, and rich visual analytics via MRBench-derived data. Intrinsic results show LoMTL achieving strong accuracy and macro-F1, competitive with GPT-5 and superior to Prometheus2, and a human study corroborates usability and perceived accuracy. The work establishes a scalable, configurable platform for AI-in-Education evaluation and demonstrative deployments, with clear paths for extending domains, languages, and taxonomies.

Abstract

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.

AITutor-EvalKit: Exploring the Capabilities of AI Tutors

TL;DR

AITutor-EvalKit presents an open-source toolkit to automatically and interactively evaluate the pedagogical quality of AI tutors using a four-dimension SMR taxonomy. The backend combines a lightweight LoRA-based multi-task learner (LoMTL) with LLM-as-judge evaluations, while the frontend offers automated evaluation, LLM judging, and rich visual analytics via MRBench-derived data. Intrinsic results show LoMTL achieving strong accuracy and macro-F1, competitive with GPT-5 and superior to Prometheus2, and a human study corroborates usability and perceived accuracy. The work establishes a scalable, configurable platform for AI-in-Education evaluation and demonstrative deployments, with clear paths for extending domains, languages, and taxonomies.

Abstract

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.

Paper Structure

This paper contains 35 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: This example shows a sample dialogue and its pedagogical-ability evaluation by the LoMTL model using the AITutor-EvalKit. The evaluation follows the four dimensions from kochmar-etal-2025-findings: MI (Mistake Identification), ML (Mistake Location), PG (Providing Guidance), and AC (Actionability). TSE: To some extent.
  • Figure 2: AITutor-EvalKit pipeline: The backend module includes several model options to assess the pedagogical soundness of tutors' responses, and the frontend projects evaluation outputs via an interactive user interface.
  • Figure 3: Participants' responses indicating how often they agreed with the models' judgments in single-response and comparison modes.
  • Figure 4: Overview of the prompt components, their associated definitions and details, and the final prompt structure used in LoMTL.
  • Figure 5: Informed consent form that participants were required to accept before proceeding with their feedback and annotations.
  • ...and 10 more figures