Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Patrick Diehl; Noujoud Nader; Steve Brandt; Hartmut Kaiser

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Patrick Diehl, Noujoud Nader, Steve Brandt, Hartmut Kaiser

TL;DR

This study evaluates ChatGPT 3.5 and 4.0 as AI code generators across nine languages for three scientific tasks: numerical integration $($NI$)$, a conjugate gradient solver $($CGS$)$, and a parallel 1D stencil heat equation solver $($PHS$)$. It systematically assesses compilation, runtime, and accuracy, revealing that both models can produce compilable code, but correctness varies by language and task, with parallel codes being especially error-prone. The authors quantify code quality using LOC and a COCOMO-based metric, finding that languages like Matlab and Python often yield smaller code and higher quality, while C++ and Java tend to be more robust across tasks. The results inform practitioners about current limitations and guide future research on prompting strategies, parallel programming support, and GPU/distributed code generation for AI-assisted HPC coding.

Abstract

This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly.

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

TL;DR

This study evaluates ChatGPT 3.5 and 4.0 as AI code generators across nine languages for three scientific tasks: numerical integration

, a conjugate gradient solver

CGS

, and a parallel 1D stencil heat equation solver

PHS

. It systematically assesses compilation, runtime, and accuracy, revealing that both models can produce compilable code, but correctness varies by language and task, with parallel codes being especially error-prone. The authors quantify code quality using LOC and a COCOMO-based metric, finding that languages like Matlab and Python often yield smaller code and higher quality, while C++ and Java tend to be more robust across tasks. The results inform practitioners about current limitations and guide future research on prompting strategies, parallel programming support, and GPU/distributed code generation for AI-assisted HPC coding.

Abstract

Paper Structure (7 sections, 1 equation, 3 figures, 3 tables)

This paper contains 7 sections, 1 equation, 3 figures, 3 tables.

Introduction
Related Work
Methodology
Quality of the generated software
Common issues
Code metrics
Discussion and Conclusion

Figures (3)

Figure 1: Software engineering metrics for the numerical integration example: \ref{['fig:line:of:codes:int']} Lines of code for all implementations using the maximal lines of code. The numbers were determined with the Linux tool cloc and \ref{['fig:two:dim:plot:int']} Two-dimensional classification using the computational time and the COCOMO model.
Figure 2: Software engineering metrics for the conjugate gradient solver: \ref{['fig:line:of:codes:cgm']} Lines of code for all implementations using the maximal lines of code. The numbers were determined with the Linux tool cloc and \ref{['fig:two:dim:plot:cgm']} Two-dimensional classification using the computational time and the COCOMO model.
Figure 3: Software engineering metrics for the parallel heat equation solver: \ref{['fig:line:of:codes']} Lines of code for all implementations using the maximal lines of code. The numbers were determined with the Linux tool cloc and \ref{['fig:two:dim:plot']} Two-dimensional classification using the computational time and the COCOMO model.

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

TL;DR

Abstract

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Authors

TL;DR

Abstract

Table of Contents

Figures (3)