StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Xinyi Song; Lina Lee; Kexin Xie; Xueying Liu; Xinwei Deng; Yili Hong

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Xinyi Song, Lina Lee, Kexin Xie, Xueying Liu, Xinwei Deng, Yili Hong

TL;DR

StatLLM addresses the lack of benchmarks for evaluating LLM-generated statistical code in SAS and R by introducing an open-source dataset with statistical analysis tasks, LLM-generated SAS code, and expert human evaluation scores. It collects 65 CSV datasets that yield 207 tasks, paired with human-verified SAS solutions and evaluation criteria to assess correctness, executability, and output quality across GPT-3.5, GPT-4.0, and Llama 3.1 models. The study analyzes correlations between automatic NLP metrics (e.g., BLEU, ROUGE, METEOR, CodeBERTScore, chrF, Jaccard) and human judgments, showing only moderate correlations and demonstrating that ML-based metrics (notably XGBoost) can better predict human scores, with the best model achieving a correlation around 0.43. It also showcases practical implications, including ensemble approaches, cross-language extensions, and an AI-powered R Shiny app for end-to-end automated statistical analysis, underscoring StatLLM’s potential to advance AI-assisted statistics and reproducible research.

Abstract

The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 for those tasks. The third component contains evaluation scores from human experts in assessing the correctness, effectiveness, readability, executability, and output accuracy of the LLM-generated code. We also illustrate the unique potential of the established benchmark dataset for (1) evaluating and enhancing natural language processing metrics, (2) assessing and improving LLM performance in statistical coding, and (3) developing and testing of next-generation statistical software - advancements that are crucial for data science and machine learning research.

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

TL;DR

Abstract

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)