Table of Contents
Fetching ...

A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Alessio Buscemi

TL;DR

This study evaluates ChatGPT 3.5's ability to generate runnable code across 10 languages and 4 domains using 40 tasks, highlighting substantial non-determinism and language-dependent performance. It employs a controlled API-based setup with a fixed prompting strategy, analyzing executability, time to generate, and code length across languages and task categories. Key findings show Julia achieving the highest executable rate while C++ performs poorly, with high-level languages generally more amenable to code generation than low-level ones; the study also notes ethical and operational limitations affecting outputs. The work discusses implications for language evolution, industry adoption, and the need for standardized, multi-language benchmarking to fairly assess LLM-assisted code generation and guide future research and policy. The results suggest LLMs could disrupt software development workflows, driving efficiency while necessitating reskilling and ethical governance.

Abstract

Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training using large datasets in order to understand and produce language that closely resembles that of humans. These models have reached a level of proficiency where they are capable of successfully completing university exams across several disciplines and generating functional code to handle novel problems. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022, which has gained significant recognition for its impressive text generating and code creation capabilities. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains. Based on the findings derived from this research, major unexpected behaviors and limitations of the model have been identified. This study aims to identify potential areas for development and examine the ramifications of automated code generation on the evolution of programming languages and on the tech industry.

A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

TL;DR

This study evaluates ChatGPT 3.5's ability to generate runnable code across 10 languages and 4 domains using 40 tasks, highlighting substantial non-determinism and language-dependent performance. It employs a controlled API-based setup with a fixed prompting strategy, analyzing executability, time to generate, and code length across languages and task categories. Key findings show Julia achieving the highest executable rate while C++ performs poorly, with high-level languages generally more amenable to code generation than low-level ones; the study also notes ethical and operational limitations affecting outputs. The work discusses implications for language evolution, industry adoption, and the need for standardized, multi-language benchmarking to fairly assess LLM-assisted code generation and guide future research and policy. The results suggest LLMs could disrupt software development workflows, driving efficiency while necessitating reskilling and ethical governance.

Abstract

Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training using large datasets in order to understand and produce language that closely resembles that of humans. These models have reached a level of proficiency where they are capable of successfully completing university exams across several disciplines and generating functional code to handle novel problems. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022, which has gained significant recognition for its impressive text generating and code creation capabilities. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains. Based on the findings derived from this research, major unexpected behaviors and limitations of the model have been identified. This study aims to identify potential areas for development and examine the ramifications of automated code generation on the evolution of programming languages and on the tech industry.
Paper Structure (21 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Status of the output generated by ChatGPT for the 4,000 tests, grouped by programming language and category.
  • Figure 2: $P_{\ell}$ of each language.
  • Figure 3: Mean Coefficient of Variation of ChatGPT's response time across all tasks, divided by language.
  • Figure 4: $LoC_{\ell}$ and $NoC_{\ell}$ of each language.
  • Figure 5: Mean Coefficient of Variation of Lines of Code and Number of Codes produce by ChatGPT across all tasks, divided by language.