PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback
Nils Wandel, David Stotko, Alexander Schier, Reinhard Klein
TL;DR
Grading and providing feedback for Jupyter notebook assignments in STEM is labor-intensive and time-delayed. PyEvalAI presents a privacy-preserving, open-source solution that combines unit tests with a locally hosted inference engine to automatically score notebooks and deliver feedback, while keeping tutors in control of final grades. A numerics course case study shows AI feedback aligns with human grading in the majority of cases (182/277, 65.7%), with a mean difference of $-0.14\%$ and standard deviation $20.73\%$, and feedback turnaround around 88.2 seconds per submission, enabling rapid multiple attempts and notable student improvement. The work demonstrates the practical value of local inference for scalable, privacy-respecting educational tooling and outlines avenues for broader adoption, enhanced prompting strategies, and future comparison of local inference configurations.
Abstract
Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.
