Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

Shi Ding; Brian Magerko

Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

Shi Ding, Brian Magerko

TL;DR

The paper tackles the gap in AI education evaluation by arguing that traditional metrics neglect learner agency, context, and ethics. It introduces TEACH-AI, a domain-independent benchmark comprising ten human-centered components, plus a practical reflective toolkit to evaluate and guide the design of generative AI tutors in education. Grounded in a scoping review across pre-LLM, transformer, and GenAI eras, the framework emphasizes explainability, adaptivity, equity, and stakeholder collaboration, with early design implications to shape future benchmarks. The work aims to enable co-design and scalable, responsible AI evaluation ecosystems that promote long-term educational impact and inclusive adoption in diverse learning environments.

Abstract

As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics), a domain-independent, pedagogically grounded, and stakeholder-aligned framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks "evaluation" through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.

Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

TL;DR

Abstract

Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

TL;DR

Abstract

Paper Structure

Table of Contents