"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Vibhor Agarwal; Madhav Krishan Garg; Sahiti Dharmavaram; Dhruv Kumar

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Vibhor Agarwal, Madhav Krishan Garg, Sahiti Dharmavaram, Dhruv Kumar

TL;DR

This study systematically compares four publicly available LLMs (Google Bard, ChatGPT-3.5, GitHub Copilot Chat, and Microsoft Copilot) across five undergraduate CS task domains common in India, combining quantitative accuracy on LeetCode-style problems with qualitative analyses of explanations, assignments, learning, and email writing. The methodology involves seven student evaluators rating model outputs on a 1-10 scale using predefined metrics, with datasets drawn from actual coursework and prompts spanning code explanation, programming, theory, humanities, learning frameworks, and communication. Key findings reveal that no single model dominates all tasks: Copilot excels in code explanations and programming-oriented tasks, Copilot Chat leads in programming assignments and interview prep, Bard shines in learning new concepts and frameworks, and ChatGPT performs best in drafting emails. The work offers practical guidance for students and instructors on selecting and integrating LLMs into CS education, while noting limitations such as task scope, dataset biases, and the rapidly evolving model landscape.

Abstract

This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students in India. These tasks include code explanation and documentation, solving class assignments, technical interview preparation, learning new concepts and frameworks, and email writing. Evaluation for these tasks was carried out by pre-final year and final year undergraduate computer science students and provides insights into the models' strengths and limitations. This study aims to guide students as well as instructors in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

TL;DR

Abstract

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Authors

TL;DR

Abstract

Table of Contents