Table of Contents
Fetching ...

The StudyChat Dataset: Analyzing Student Dialogues With ChatGPT in an Artificial Intelligence Course

Hunter McNichols, Fareya Ikram, Andrew Lan

TL;DR

It is found that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams, and students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others.

Abstract

The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be observed and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPT's core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.

The StudyChat Dataset: Analyzing Student Dialogues With ChatGPT in an Artificial Intelligence Course

TL;DR

It is found that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams, and students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others.

Abstract

The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be observed and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPT's core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.

Paper Structure

This paper contains 17 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of the dataset collection and analysis pipeline present in our work.
  • Figure 2: The web application used by students in this study. The user interface mirrors the functionality of ChatGPT.
  • Figure 3: Average assignment scores (LLM-supported) vs. exam scores (No-LLM) for Fall 2024 (left, circles) and Spring 2025 (right, squares), grouped by interaction level: bottom 10% (red), middle 80% (blue), and top 10% (green). Grey dashed line shows equal outcome, and red line shows the data trend for each semester; top users cluster at higher scores, and bottom users show a wider spread and more low outliers across both semesters.
  • Figure 4: PCA of DA features for Fall (top-left) and Spring (top-right). The table (bottom) summarizes the cluster distributions and shows the percentage of specific dialogue acts for the top 10 most common specific DA across all clusters. We see the type of question and writing requests as consistent distinguishing features across clusters. See appendix for further cluster details.
  • Figure 5: Comparison of dialogue act PCA clusters from Fall 2024. The radar plot (top) shows the top 10 most prominent specific dialog acts with values indicating the mean percentage of that act type within each cluster. The table (bottom) reports the number of students in each cluster and their mean aggregate course scores. Exam scores of cluster 3 are lower on average than other clusters.
  • ...and 2 more figures