Table of Contents
Fetching ...

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

Arav Agarwal, Karthik Mittal, Aidan Doyle, Pragnya Sridhar, Zipiao Wan, Jacob Arthur Doughty, Jaromir Savelka, Majd Sakr

TL;DR

This paper investigates how GPT-4's temperature parameter affects the diversity of automatically generated MCQs. A pipeline conditioned on learning objectives and course/module information generates $52$ LOs across three temperatures ($0.2$, $1.0$, $1.2$), with $313$ instructor annotations evaluating Q1-distinct and Q2-complete and reporting inter-rater reliability $\kappa = 0.30472$ (Fleiss's $\kappa$). The results show that $0.2$ yields more duplicates, while $1.0$ and $1.2$ yield more distinct questions, with a significant difference between $0.2$ and the others but no difference between $1.0$ and $1.2$. Qualitative analysis indicates higher Bloom's Taxonomy levels reduce duplicates, while low-level LOs remain harder to diversify, guiding practical use of temperature settings ($1.0$--$1.2$) for diverse MCQ generation and suggesting directions for prompting and diversity-promoting strategies.

Abstract

We conduct a preliminary study of the effect of GPT's temperature parameter on the diversity of GPT4-generated questions. We find that using higher temperature values leads to significantly higher diversity, with different temperatures exposing different types of similarity between generated sets of questions. We also demonstrate that diverse question generation is especially difficult for questions targeting lower levels of Bloom's Taxonomy.

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

TL;DR

This paper investigates how GPT-4's temperature parameter affects the diversity of automatically generated MCQs. A pipeline conditioned on learning objectives and course/module information generates LOs across three temperatures (, , ), with instructor annotations evaluating Q1-distinct and Q2-complete and reporting inter-rater reliability (Fleiss's ). The results show that yields more duplicates, while and yield more distinct questions, with a significant difference between and the others but no difference between and . Qualitative analysis indicates higher Bloom's Taxonomy levels reduce duplicates, while low-level LOs remain harder to diversify, guiding practical use of temperature settings (--) for diverse MCQ generation and suggesting directions for prompting and diversity-promoting strategies.

Abstract

We conduct a preliminary study of the effect of GPT's temperature parameter on the diversity of GPT4-generated questions. We find that using higher temperature values leads to significantly higher diversity, with different temperatures exposing different types of similarity between generated sets of questions. We also demonstrate that diverse question generation is especially difficult for questions targeting lower levels of Bloom's Taxonomy.
Paper Structure (5 sections, 1 table)

This paper contains 5 sections, 1 table.