Table of Contents
Fetching ...

AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI

Jérémy Barghorn, Anna Sotnikova, Sacha Friedli, Antoine Bosselut

Abstract

Large-enrollment university courses face persistent challenges in providing timely and scalable instructional support. While generative AI holds promise, its effective use depends on reliability and pedagogical alignment. We present a human-centered case study of AI-assisted support in a Calculus I course, implemented in close collaboration with the course instructor. We developed a system to answer students' questions on a discussion forum, fine-tuning a lightweight language model on 2,588 historical student-instructor interactions. The model achieved 75.3% accuracy on a benchmark of 150 representative questions annotated by five instructors, and in 36% of cases, its responses were rated equal to or better than instructor answers. Post-deployment student survey (N = 105) indicated that students valued the alignment of the responses with the course materials and their immediate availability, while still relying on the instructor verification for trust. We highlight the importance of hybrid human-AI workflows for safe and effective course support.

AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI

Abstract

Large-enrollment university courses face persistent challenges in providing timely and scalable instructional support. While generative AI holds promise, its effective use depends on reliability and pedagogical alignment. We present a human-centered case study of AI-assisted support in a Calculus I course, implemented in close collaboration with the course instructor. We developed a system to answer students' questions on a discussion forum, fine-tuning a lightweight language model on 2,588 historical student-instructor interactions. The model achieved 75.3% accuracy on a benchmark of 150 representative questions annotated by five instructors, and in 36% of cases, its responses were rated equal to or better than instructor answers. Post-deployment student survey (N = 105) indicated that students valued the alignment of the responses with the course materials and their immediate availability, while still relying on the instructor verification for trust. We highlight the importance of hybrid human-AI workflows for safe and effective course support.

Paper Structure

This paper contains 41 sections, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Example from the dataset, showing the context, student question, instructor answer, and our model response.
  • Figure 2: Survey layout showing the information presented to instructors (student question, context, instructor answer, and model response) and the evaluation criteria used for annotation.
  • Figure 3: Human annotation results for the dataset of 150 questions on (a) Correctness, (b) Relevance, and (c) Completeness. 95% confidence intervals are computed via non-parametric bootstrap with 10000 resamples.
  • Figure 4: Evaluation of base and finetuned models on 40 test questions by a single expert. The results are shown by the question's difficulty: we have 10 examples of easy questions, 20 medium, and 10 hard ones. Base models: Llama 3.1, Mathstral, Qwen 2.5, Gemma 3, and DeepSeek R1 Distill. Models with "-QA" are finetuned on our dataset; "OpenMath220k" are finetuned on the respective math dataset. Scores shown are averages. 95% confidence intervals are computed via non-parametric bootstrap with 10000 resamples.
  • Figure 5: Distribution of model response alignment across question types for a set of 150 Calculus I questions. Each question type (e.g., Explanation, Hint, Clarification) is labeled with its proportion in the dataset. Bar colors represent the degree of alignment between the model's response and the student question intent: Misaligned, Partially aligned, or Fully aligned.
  • ...and 15 more figures