Table of Contents
Fetching ...

Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants

Md Sultanul Islam Ovi, Nafisa Anjum, Tasmina Haque Bithe, Md. Mahabubur Rahman, Mst. Shahnaj Akter Smrity

TL;DR

The paper addresses how leading AI-driven programming assistants perform on competitive programming tasks using a controlled LeetCode benchmark. By evaluating 300 problems across three difficulty levels with consistent tool settings and two phases (problem solving and debugging), it reveals that GitHub Copilot excels on easy/medium problems, ChatGPT offers superior memory efficiency and debugging capability, and Codeium struggles with complex tasks. The findings quantify tool strengths and limitations, informing developers and researchers about when and how to integrate AI assistants into coding workflows. Overall, while all tools boost productivity in simpler tasks, they do not yet outperform humans on hardest problems, highlighting the need for careful tooling choices and future improvements in AI-assisted programming.

Abstract

With the increasing adoption of AI-driven tools in software development, large language models (LLMs) have become essential for tasks like code generation, bug fixing, and optimization. Tools like ChatGPT, GitHub Copilot, and Codeium provide valuable assistance in solving programming challenges, yet their effectiveness remains underexplored. This paper presents a comparative study of ChatGPT, Codeium, and GitHub Copilot, evaluating their performance on LeetCode problems across varying difficulty levels and categories. Key metrics such as success rates, runtime efficiency, memory usage, and error-handling capabilities are assessed. GitHub Copilot showed superior performance on easier and medium tasks, while ChatGPT excelled in memory efficiency and debugging. Codeium, though promising, struggled with more complex problems. Despite their strengths, all tools faced challenges in handling harder problems. These insights provide a deeper understanding of each tool's capabilities and limitations, offering guidance for developers and researchers seeking to optimize AI integration in coding workflows.

Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants

TL;DR

The paper addresses how leading AI-driven programming assistants perform on competitive programming tasks using a controlled LeetCode benchmark. By evaluating 300 problems across three difficulty levels with consistent tool settings and two phases (problem solving and debugging), it reveals that GitHub Copilot excels on easy/medium problems, ChatGPT offers superior memory efficiency and debugging capability, and Codeium struggles with complex tasks. The findings quantify tool strengths and limitations, informing developers and researchers about when and how to integrate AI assistants into coding workflows. Overall, while all tools boost productivity in simpler tasks, they do not yet outperform humans on hardest problems, highlighting the need for careful tooling choices and future improvements in AI-assisted programming.

Abstract

With the increasing adoption of AI-driven tools in software development, large language models (LLMs) have become essential for tasks like code generation, bug fixing, and optimization. Tools like ChatGPT, GitHub Copilot, and Codeium provide valuable assistance in solving programming challenges, yet their effectiveness remains underexplored. This paper presents a comparative study of ChatGPT, Codeium, and GitHub Copilot, evaluating their performance on LeetCode problems across varying difficulty levels and categories. Key metrics such as success rates, runtime efficiency, memory usage, and error-handling capabilities are assessed. GitHub Copilot showed superior performance on easier and medium tasks, while ChatGPT excelled in memory efficiency and debugging. Codeium, though promising, struggled with more complex problems. Despite their strengths, all tools faced challenges in handling harder problems. These insights provide a deeper understanding of each tool's capabilities and limitations, offering guidance for developers and researchers seeking to optimize AI integration in coding workflows.
Paper Structure (19 sections, 8 figures, 2 tables)

This paper contains 19 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Distribution of our dataset (300 LeetCode Problems) by Difficulty.
  • Figure 2: Distribution of the dataset problems by topic and difficulty. The dataset is evenly distributed across difficulties (easy, medium, hard) within each topic.
  • Figure 3: This figure illustrates the input format for the code generation task. A base prompt is passed to the AI model, which includes the problem definition, examples, constraints, and code structure. These instructions provide the necessary context for the model to understand how the task needs to be solved. Finally, a query is passed to the model, instructing it to generate the Python code solution based on the given context.
  • Figure 4: This figure presents the input structure for the debugging task, which is similar to the code generation input. The key difference lies in the prompt: for debugging, the previous erroneous code and the corresponding output are also provided. The context still includes the problem statement, examples, and constraints, but the model is asked to fix the errors and generate the correct Python solution.
  • Figure 5: Acceptance rates for Users, ChatGPT, Codeium, and Copilot across different difficulty levels.
  • ...and 3 more figures