Table of Contents
Fetching ...

Performance Evaluation of Large Language Models in Statistical Programming

Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong

TL;DR

This paper addresses the systematic evaluation of large language models for statistical programming in SAS by building a human-evaluated benchmark across $207$ tasks and three LLMs (GPT-3.5, GPT-4.0, Llama 3.1 70B). It introduces a three-group, ten-criterion rating rubric and a robust rating pipeline with $9$ expert raters, yielding $18{,}630$ scores and bootstrap-based inferences. The findings show LLMs can produce syntactically valid SAS code, but challenges remain in code executability and output correctness, with no model consistently outperforming others across all criteria. The work provides a data-driven framework and insights to guide future AI-assisted statistical coding, along with open data for benchmarking and metric development.

Abstract

The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

Performance Evaluation of Large Language Models in Statistical Programming

TL;DR

This paper addresses the systematic evaluation of large language models for statistical programming in SAS by building a human-evaluated benchmark across tasks and three LLMs (GPT-3.5, GPT-4.0, Llama 3.1 70B). It introduces a three-group, ten-criterion rating rubric and a robust rating pipeline with expert raters, yielding scores and bootstrap-based inferences. The findings show LLMs can produce syntactically valid SAS code, but challenges remain in code executability and output correctness, with no model consistently outperforming others across all criteria. The work provides a data-driven framework and insights to guide future AI-assisted statistical coding, along with open data for benchmarking and metric development.

Abstract

The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Study design illustration showing the evaluation setup (a) and the rating procedure (b). The LLMs under investigation are GPT 3.5, GPT 4.0, and Llama 3.1 70B. The standard SAS code refers to the human-verified SAS code, and those $x_{i,j,k}^{M}$'s are rating scores.
  • Figure 2: Examples of data description, problem descriptions, and human-verified SAS code.
  • Figure 3: Plot of the mean total score with error bars representing the SD for the three LLMs (a), and histograms displaying the distributions of total scores (b).
  • Figure 4: Plot of the mean group scores with error bars representing the SD for the three LLMs. Note that the score ranges for each group are different.
  • Figure 5: Radar plot of the means of individual criterion scores among three LLMs.
  • ...and 3 more figures