Table of Contents
Fetching ...

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri

TL;DR

CSEPrompts introduces a reproducible benchmark for introductory CS tasks by collecting $269$ prompts (118 coding-site, 101 MOOC coding prompts) and $50$ MOOC MCQs from coding websites and MOOCs, each with at least five test cases. The study evaluates eight state-of-the-art LLMs on Python code generation and MCQ answering, using uniform prompts and pytest-based evaluation to measure performance such as $Pass@1$ for code tasks. Key findings show MOOC prompts are more challenging than coding-site prompts, while GPT-3.5 often leads overall; code LLMs excel at coding tasks, whereas raw LLMs tend to fare better on MCQs. The work highlights the potential and limits of current LLMs in CS education and sets a path for broader future evaluations, including more prompts and additional models.

Abstract

Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

TL;DR

CSEPrompts introduces a reproducible benchmark for introductory CS tasks by collecting prompts (118 coding-site, 101 MOOC coding prompts) and MOOC MCQs from coding websites and MOOCs, each with at least five test cases. The study evaluates eight state-of-the-art LLMs on Python code generation and MCQ answering, using uniform prompts and pytest-based evaluation to measure performance such as for code tasks. Key findings show MOOC prompts are more challenging than coding-site prompts, while GPT-3.5 often leads overall; code LLMs excel at coding tasks, whereas raw LLMs tend to fare better on MCQs. The work highlights the potential and limits of current LLMs in CS education and sets a path for broader future evaluations, including more prompts and additional models.

Abstract

Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.
Paper Structure (16 sections, 4 figures, 4 tables)

This paper contains 16 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Sample Prompt for Coding Tasks.
  • Figure 2: Sample Prompt for MCQs.
  • Figure 3: Comparing CSEPrompts with HumanEval and MBPP based on Pass@1.
  • Figure 4: Comparing CSEPrompts-MCQ with MathQA based on Zero Shot Prompting (in percentage).