Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Wenhan Lyu; Yimeng Wang; Tingting; Chung; Yifan Sun; Yixuan Zhang

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Wenhan Lyu, Yimeng Wang, Tingting, Chung, Yifan Sun, Yixuan Zhang

TL;DR

This semester-long field study evaluates CodeTutor, an LLM-powered tutoring tool, in an introductory CS course with 50 undergraduates. Using a between-subject design, the authors compare CodeTutor users to a traditional TA-supported control and analyze learning outcomes, attitudes, and engagement, linking performance to prompt quality and prior AI experience. Key findings show significant score improvements for CodeTutor users, especially those new to LLMs, while attitudes evolve over time toward greater reliance on human TAs and skepticism about AI-driven critical thinking; prompt clarity strongly predicts AI response effectiveness. The work underlines the need for Generative AI literacy in curricula, proposes design considerations for AI-assisted learning, and highlights the temporal dynamics and alignment challenges of LLM-based tutoring in CS education.

Abstract

The integration of AI assistants, especially through the development of Large Language Models (LLMs), into computer science education has sparked significant debate. An emerging body of work has looked into using LLMs in education, but few have examined the impacts of LLMs on students in entry-level programming courses, particularly in real-world contexts and over extended periods. To address this research gap, we conducted a semester-long, between-subjects study with 50 students using CodeTutor, an LLM-powered assistant developed by our research team. Our study results show that students who used CodeTutor (the experimental group) achieved statistically significant improvements in their final scores compared to peers who did not use the tool (the control group). Within the experimental group, those without prior experience with LLM-powered tools demonstrated significantly greater performance gain than their counterparts. We also found that students expressed positive feedback regarding CodeTutor's capability, though they also had concerns about CodeTutor's limited role in developing critical thinking skills. Over the semester, students' agreement with CodeTutor's suggestions decreased, with a growing preference for support from traditional human teaching assistants. Our analysis further reveals that the quality of user prompts was significantly correlated with CodeTutor's response effectiveness. Building upon our results, we discuss the implications of our findings for integrating Generative AI literacy into curricula to foster critical thinking skills and turn to examining the temporal dynamics of user engagement with LLM-powered tools. We further discuss the discrepancy between the anticipated functions of tools and students' actual capabilities, which sheds light on the need for tailored strategies to improve educational outcomes.

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 6 figures, 4 tables)

This paper contains 31 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related Work
Intelligent Tutoring Systems
Large Language Models in CS Education
Method
Design of CodeTutor
Participants
Study Procedure & Data Collection
Pre-test
Control vs. Experimental Group
Student Evaluation
Data Analysis
Quantitative Data Analysis
Qualitative Data Analysis
Results
...and 16 more sections

Figures (6)

Figure 1: CodeTutor is a web application that leverages OpenAI API, featuring four main components: Conversation History that lists different conversation threads, Main Conversation that shows an ongoing dialogue with CodeTutor, Conversation-level Feedback module that allows users to elaborate on their attitudes towards CodeTutor by proving ratings on 1) comprehension, 2) critical thinking, 3) syntax mastery, 4) independent learning, and 5) TA replacement likelihood, and to provide specific comments, and Message-level Feedback that offers options for users to give detailed feedback on individual messages or responses from CodeTutor.
Figure 2: Parametric pairwise comparison (ANOVA) reveals no significant difference in correct answer count of pre-test in the control and experimental groups.
Figure 3: Parametric pairwise comparison (ANOVA) reveals a significantly higher mean score in the "CodeTutor-Allowed" group compared to the "CodeTutor-Not-Allowed" group.
Figure 4: Participants’ attitudes toward CodeTutor, in terms of comprehension, critical thinking, syntax mastery, independent learning, and TA replacement (see \ref{['fig:CodeTutor_UI']} for detailed questions).
Figure 5: A correlation matrix heatmap visualizing the relationship between different metrics. The blue color indicates positive correlations, while pink represents negative correlations. Correlation coefficients are displayed inside each cell.
...and 1 more figures

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

TL;DR

Abstract

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Authors

TL;DR

Abstract

Table of Contents

Figures (6)