Understanding the Role of Temperature in Diverse Question Generation by GPT-4

Arav Agarwal; Karthik Mittal; Aidan Doyle; Pragnya Sridhar; Zipiao Wan; Jacob Arthur Doughty; Jaromir Savelka; Majd Sakr

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

Arav Agarwal, Karthik Mittal, Aidan Doyle, Pragnya Sridhar, Zipiao Wan, Jacob Arthur Doughty, Jaromir Savelka, Majd Sakr

TL;DR

This paper investigates how GPT-4's temperature parameter affects the diversity of automatically generated MCQs. A pipeline conditioned on learning objectives and course/module information generates $52$ LOs across three temperatures ($0.2$, $1.0$, $1.2$), with $313$ instructor annotations evaluating Q1-distinct and Q2-complete and reporting inter-rater reliability $\kappa = 0.30472$ (Fleiss's $\kappa$). The results show that $0.2$ yields more duplicates, while $1.0$ and $1.2$ yield more distinct questions, with a significant difference between $0.2$ and the others but no difference between $1.0$ and $1.2$. Qualitative analysis indicates higher Bloom's Taxonomy levels reduce duplicates, while low-level LOs remain harder to diversify, guiding practical use of temperature settings ($1.0$--$1.2$) for diverse MCQ generation and suggesting directions for prompting and diversity-promoting strategies.

Abstract

We conduct a preliminary study of the effect of GPT's temperature parameter on the diversity of GPT4-generated questions. We find that using higher temperature values leads to significantly higher diversity, with different temperatures exposing different types of similarity between generated sets of questions. We also demonstrate that diverse question generation is especially difficult for questions targeting lower levels of Bloom's Taxonomy.

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

TL;DR

This paper investigates how GPT-4's temperature parameter affects the diversity of automatically generated MCQs. A pipeline conditioned on learning objectives and course/module information generates

LOs across three temperatures (

), with

instructor annotations evaluating Q1-distinct and Q2-complete and reporting inter-rater reliability

(Fleiss's

). The results show that

yields more duplicates, while

and

yield more distinct questions, with a significant difference between

and the others but no difference between

and

. Qualitative analysis indicates higher Bloom's Taxonomy levels reduce duplicates, while low-level LOs remain harder to diversify, guiding practical use of temperature settings (

) for diverse MCQ generation and suggesting directions for prompting and diversity-promoting strategies.

Abstract

Paper Structure (5 sections, 1 table)

This paper contains 5 sections, 1 table.

Introduction
Methodology
Results
Discussion
Future Work

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

TL;DR

Abstract

Understanding the Role of Temperature in Diverse Question Generation by GPT-4

Authors

TL;DR

Abstract

Table of Contents