Table of Contents
Fetching ...

PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback

Nils Wandel, David Stotko, Alexander Schier, Reinhard Klein

TL;DR

Grading and providing feedback for Jupyter notebook assignments in STEM is labor-intensive and time-delayed. PyEvalAI presents a privacy-preserving, open-source solution that combines unit tests with a locally hosted inference engine to automatically score notebooks and deliver feedback, while keeping tutors in control of final grades. A numerics course case study shows AI feedback aligns with human grading in the majority of cases (182/277, 65.7%), with a mean difference of $-0.14\%$ and standard deviation $20.73\%$, and feedback turnaround around 88.2 seconds per submission, enabling rapid multiple attempts and notable student improvement. The work demonstrates the practical value of local inference for scalable, privacy-respecting educational tooling and outlines avenues for broader adoption, enhanced prompting strategies, and future comparison of local inference configurations.

Abstract

Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.

PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback

TL;DR

Grading and providing feedback for Jupyter notebook assignments in STEM is labor-intensive and time-delayed. PyEvalAI presents a privacy-preserving, open-source solution that combines unit tests with a locally hosted inference engine to automatically score notebooks and deliver feedback, while keeping tutors in control of final grades. A numerics course case study shows AI feedback aligns with human grading in the majority of cases (182/277, 65.7%), with a mean difference of and standard deviation , and feedback turnaround around 88.2 seconds per submission, enabling rapid multiple attempts and notable student improvement. The work demonstrates the practical value of local inference for scalable, privacy-respecting educational tooling and outlines avenues for broader adoption, enhanced prompting strategies, and future comparison of local inference configurations.

Abstract

Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.

Paper Structure

This paper contains 19 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Architecture of PyEvalAI. Section \ref{['sec:user_interface']} describes the front-end, Section \ref{['sec:backend']} provides details about the back-end.
  • Figure 2: Example of Jupyter notebook for students. 1. block: login and enter numerics course. 2. block: exercise description. 3. block: student solution. 4. block: hand in exercise and obtain feedback by pyevalai. For best clarity, please view this figure digitally and zoom in as needed.
  • Figure 3: Jupyter notebook for admins. 1. block: login and enter course. 2. block: specify task, solution, unit tests and register exercise. 3. block: remove exercise
  • Figure 4: Students can easily overview all exercises, already achieved scores, corresponding deadlines and numbers of attempts.
  • Figure 5: Tutors can oversee in real-time all grades achieved by the students for the individual assignments in a table. By clicking on a grade, tutors can access further details and fix incorrect grades (see Figure \ref{['fig:exercise_tutor']}).
  • ...and 5 more figures