Table of Contents
Fetching ...

Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology

Junior Cedric Tonga, Benjamin Clement, Pierre-Yves Oudeyer

TL;DR

Using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles.

Abstract

The automatic generation of hints by Large Language Models (LLMs) within Intelligent Tutoring Systems (ITSs) has shown potential to enhance student learning. However, generating pedagogically sound hints that address student misconceptions and adhere to specific educational objectives remains challenging. This work explores using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles. We present here the study of several dimensions: 1) identifying error patterns made by simulated students on secondary-level math exercises; 2) developing various prompts for GPT-4o as a teacher and evaluating their effectiveness in generating hints that enable simulated students to self-correct; and 3) testing the best-performing prompts, based on their ability to produce relevant hints and facilitate error correction, with Llama-3-8B-Instruct as the teacher, allowing for a performance comparison with GPT-4o. The results show that model errors increase with higher temperature settings. Notably, when hints are generated by GPT-4o, the most effective prompts include prompts tailored to specific errors as well as prompts providing general hints based on common mathematical errors. Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o. Also the problem-solving and response revision capabilities of the LLMs as students, particularly GPT-3.5-turbo, improved significantly after receiving hints, especially at lower temperature settings. However, models like Mistral-7B-Instruct demonstrated a decline in performance as the temperature increased.

Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology

TL;DR

Using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles.

Abstract

The automatic generation of hints by Large Language Models (LLMs) within Intelligent Tutoring Systems (ITSs) has shown potential to enhance student learning. However, generating pedagogically sound hints that address student misconceptions and adhere to specific educational objectives remains challenging. This work explores using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles. We present here the study of several dimensions: 1) identifying error patterns made by simulated students on secondary-level math exercises; 2) developing various prompts for GPT-4o as a teacher and evaluating their effectiveness in generating hints that enable simulated students to self-correct; and 3) testing the best-performing prompts, based on their ability to produce relevant hints and facilitate error correction, with Llama-3-8B-Instruct as the teacher, allowing for a performance comparison with GPT-4o. The results show that model errors increase with higher temperature settings. Notably, when hints are generated by GPT-4o, the most effective prompts include prompts tailored to specific errors as well as prompts providing general hints based on common mathematical errors. Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o. Also the problem-solving and response revision capabilities of the LLMs as students, particularly GPT-3.5-turbo, improved significantly after receiving hints, especially at lower temperature settings. However, models like Mistral-7B-Instruct demonstrated a decline in performance as the temperature increased.

Paper Structure

This paper contains 46 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: General approach: An LLM acting as a student solves a math exercise and provides its answer and reasoning. If the answer is incorrect, the exercise details (instructions, cognitive approach, correct answer, exercise statement) are passed, with or without the LLM student's answer and reasoning (depending on the hint-generation prompt type), to another LLM acting as a teacher. The LLM teacher model generates a hint in the form of question using different hint-generation prompts. The hint is then provided to the same LLM student model to revise its response.
  • Figure 2: Comparison of accuracy before and after providing hints across four exercises for different student models, using GPT-4o and Llama-3-8B-instruct as teacher models with the best specialized hint generation prompt focused on calculation errors. The results show improved performance when using Llama-3-8B-instruct as the teacher model.
  • Figure 3: Comparison of accuracy before and after providing hints across four exercises for different student models, using GPT-4o and Llama-3-8B-instruct as teacher models with the best baseline-type hint generation prompt, named BaselineTwo. The results show improved performance when using Llama-3-8B-instruct as the teacher model.
  • Figure 4: Comparison of accuracy before and after providing hints across each exercise for different student models, using GPT-4o and Llama-3-8B-instruct as teacher models with the best specialized hint generation prompt focused on calculation errors. The results show improved performance when using Llama-3-8B-instruct as the teacher model, but the student models struggled to correct themselves on exercise 3 - module 7.
  • Figure 5: Comparison of accuracy before and after providing hints across each exercise for different student models, using GPT-4o and Llama-3-8B-instruct as teacher models with the best baseline-type hint generation prompt, named BaselineTwo. The results show improved performance when using Llama-3-8B-instruct as the teacher model, but the student models struggled to correct themselves on exercise 3 - module 7.
  • ...and 7 more figures