AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar
TL;DR
AITutor-EvalKit presents an open-source toolkit to automatically and interactively evaluate the pedagogical quality of AI tutors using a four-dimension SMR taxonomy. The backend combines a lightweight LoRA-based multi-task learner (LoMTL) with LLM-as-judge evaluations, while the frontend offers automated evaluation, LLM judging, and rich visual analytics via MRBench-derived data. Intrinsic results show LoMTL achieving strong accuracy and macro-F1, competitive with GPT-5 and superior to Prometheus2, and a human study corroborates usability and perceived accuracy. The work establishes a scalable, configurable platform for AI-in-Education evaluation and demonstrative deployments, with clear paths for extending domains, languages, and taxonomies.
Abstract
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.
