Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Irina Jurenka; Markus Kunesch; Kevin R. McKee; Daniel Gillick; Shaojian Zhu; Sara Wiltberger; Shubham Milind Phal; Katherine Hermann; Daniel Kasenberg; Avishkar Bhoopchand; Ankit Anand; Miruna Pîslar; Stephanie Chan; Lisa Wang; Jennifer She; Parsa Mahmoudieh; Aliya Rysbek; Wei-Jen Ko; Andrea Huber; Brett Wiltshire; Gal Elidan; Roni Rabin; Jasmin Rubinovitz; Amit Pitaru; Mac McAllister; Julia Wilkowski; David Choi; Roee Engelberg; Lidan Hackmon; Adva Levin; Rachel Griffin; Michael Sears; Filip Bar; Mia Mesar; Mana Jabbour; Arslan Chaudhry; James Cohan; Sridhar Thiagarajan; Nir Levine; Ben Brown; Dilan Gorur; Svetlana Grant; Rachel Hashimshoni; Laura Weidinger; Jieru Hu; Dawn Chen; Kuba Dolecki; Canfer Akbulut; Maxwell Bileschi; Laura Culp; Wen-Xin Dong; Nahema Marchal; Kelsie Van Deman; Hema Bajaj Misra; Michael Duah; Moran Ambar; Avi Caciularu; Sandra Lefdal; Chris Summerfield; James An; Pierre-Alexandre Kamienny; Abhinit Mohdi; Theofilos Strinopoulous; Annie Hale; Wayne Anderson; Luis C. Cobo; Niv Efron; Muktha Ananda; Shakir Mohamed; Maureen Heymans; Zoubin Ghahramani; Yossi Matias; Ben Gomes; Lila Ibrahim

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Irina Jurenka, Markus Kunesch, Kevin R. McKee, Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, Ankit Anand, Miruna Pîslar, Stephanie Chan, Lisa Wang, Jennifer She, Parsa Mahmoudieh, Aliya Rysbek, Wei-Jen Ko, Andrea Huber, Brett Wiltshire, Gal Elidan, Roni Rabin, Jasmin Rubinovitz, Amit Pitaru, Mac McAllister, Julia Wilkowski, David Choi, Roee Engelberg, Lidan Hackmon, Adva Levin, Rachel Griffin, Michael Sears, Filip Bar, Mia Mesar, Mana Jabbour, Arslan Chaudhry, James Cohan, Sridhar Thiagarajan, Nir Levine, Ben Brown, Dilan Gorur, Svetlana Grant, Rachel Hashimshoni, Laura Weidinger, Jieru Hu, Dawn Chen, Kuba Dolecki, Canfer Akbulut, Maxwell Bileschi, Laura Culp, Wen-Xin Dong, Nahema Marchal, Kelsie Van Deman, Hema Bajaj Misra, Michael Duah, Moran Ambar, Avi Caciularu, Sandra Lefdal, Chris Summerfield, James An, Pierre-Alexandre Kamienny, Abhinit Mohdi, Theofilos Strinopoulous, Annie Hale, Wayne Anderson, Luis C. Cobo, Niv Efron, Muktha Ananda, Shakir Mohamed, Maureen Heymans, Zoubin Ghahramani, Yossi Matias, Ben Gomes, Lila Ibrahim

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of responsibly deploying generative AI in education by developing LearnLM-Tutor, a pedagogy-focused tutor built through an evaluation-driven, participatory process. It introduces seven diverse educational benchmarks and a comprehensive set of fine-tuning datasets to embed learning-science principles into Gemini-based tutoring, culminating in LearnLM-Tutor (M4). Extensive human and automatic evaluations show LearnLM-Tutor generally outperforms a prompt-tuned Gemini baseline on multiple pedagogical axes and in real-world deployment (ASU Study Hall), while also detailing safety, policy, and mitigation frameworks. The work emphasizes rigorous, multi-faceted evaluation as essential for scalable, equitable EdTech, and invites the community to adopt and extend its benchmarks and governance practices to maximize the positive impact of gen AI in education.

Abstract

A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

TL;DR

Abstract

Paper Structure (111 sections, 21 figures, 16 tables)

This paper contains 111 sections, 21 figures, 16 tables.

Introduction
Participatory approach
Participatory workshops: Imagining and critiquing the future of education and AI
Understanding learning experiences: Initial interviews and Wizard-of-Oz sessions
Lessons from ShiffBot: Co-design activities
Improving Gemini for education
Lessons from learning science
Lessons from EdTech
Generative AI in education
Prompting
Fine-tuning
Our SFT datasets
Human tutoring
Gen AI role-play
GSM8k dialogue
...and 96 more sections

Figures (21)

Figure 1: LearnLM-Tutor Development: overview of our approach to responsible development of gen AI for education. Bold arrows show the development flow, dotted arrows the information flow. Our approach starts and ends with participation. We start by answering the questions of "who are we trying to help?", "what do they care about?", "who are all the relevant stakeholders?", and bring them into our development process. This informs the prioritisation of our model improvements work, and the development of our comprehensive evaluation benchmarks. These further inform model improvements (and each other) through a fast automatic evaluations-based and a slower human evaluations-based iteration loop. Finally, we use the deployment of our models to real users to further inform our research and development work, and to feed back into the participation stage. We use this approach to develop LearnLM-Tutor, a conversational AI tutor. Evaluation (teacher preferences): one of seven evaluation benchmarks introduced in this report. It shows that educators prefer LearnLM-Tutor over prompted mollick2023assigning base Gemini 1.0 on the majority of measured pedagogical attributes. Deployment (ASU Study Hall): example conversation between LearnLM-Tutor and an ASU Study Hall student enrolled in the Introduction to Programming course. Participation (learner feedback): an interview quote from an ASU Study Hall student who has used LearnLM-Tutor during their course. We use interviews to get qualitative feedback on the efficacy and safety of the tutor.
Figure 2: Overview of the evaluation taxonomy introduced in Section \ref{['sec:pedagogy_taxonomy']} that underpins the seven pedagogical evaluation benchmarks introduced in this report. Each benchmark is unique in its place within the taxonomy and comes with its own benefits and challenges. Together, these different benchmarks provide a more comprehensive view on the pedagogical capabilities of gen AI tutors. Numbers in brackets represent section numbers describing each particular benchmark.
Figure 3: Left: illustration of the arguments made in Section \ref{['sec:operationalising_pedagogy']}. Hypothetically all pedagogical behaviour can be visualised as a complex manifold lying within a high-dimensional space of all possible learning contexts (e.g. subject type, learner preferences) and pedagogical strategies and interventions (some of which may only be available in certain contexts). Only small parts of this manifold may be considered as optimal pedagogy, and such areas are hard to discover due to the complexity of the search space. Right: no ideal dataset exists for pedagogy, so we experimented with a mixture of datasets, each covering a small slice of pedagogical contexts and strategies, each with its own strengths and weaknesses, each involving varying levels of human input and effort, and each being an imperfect (to varying degrees) approximation of what may be considered as good pedagogy (see Section \ref{['sec:our_sft_datasets']} for more details).
Figure 4: Welch's t-test (with Holm-Bonferroni adjustment) effect sizes comparing the learner scores between Gemini 1.0 ($n=33$) and LearnLM-Tutor ($n=27$). Dark indicates significance ($p<0.05$).
Figure 5: Welch's t-test effect sizes (with Holm-Bonferroni adjustment) comparing the turn-level expert rater scores evaluating the pedagogical quality of Gemini 1.0 and LearnLM-Tutor across different pedagogy dimensions. Dark indicates significance ($p<0.05$). See Section \ref{['sec:per_turn_pedagogical_ratings_supplementary']} for details on what each pedagogical dimension refers to and the tutor turn counts used in these calculations.
...and 16 more figures

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

TL;DR

Abstract

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (21)