Table of Contents
Fetching ...

StackEval: Benchmarking LLMs in Coding Assistance

Nidhish Shah, Zulkuf Genc, Dogu Araci

TL;DR

StackEval and StackUnseen offer a rigorous, multi-language suite for evaluating LLMs on coding tasks, including writing, debugging, code review, and conceptual understanding. The paper also introduces a robust evaluation framework where LLMs act as judges, leveraging reference answers and, optionally, chain-of-thought reasoning to measure alignment with human experts. Key findings show that reference-based evaluation improves reliability, while generalization to emergent content remains challenging for current models. By making datasets and an interactive leaderboard public, the work aims to drive progress in AI-assisted coding and establish reproducible, scalable benchmarks. The study also analyzes self-preference biases in LLM judges, finding that grounding evaluations in high-quality references mitigates such biases. Overall, StackEval/StackUnseen provide practical, dynamic benchmarks with implications for deploying and improving AI copilots in real-world software development.

Abstract

We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance. To ensure reproducibility, we publicly share our datasets and evaluation code at https://github.com/ProsusAI/stack-eval .

StackEval: Benchmarking LLMs in Coding Assistance

TL;DR

StackEval and StackUnseen offer a rigorous, multi-language suite for evaluating LLMs on coding tasks, including writing, debugging, code review, and conceptual understanding. The paper also introduces a robust evaluation framework where LLMs act as judges, leveraging reference answers and, optionally, chain-of-thought reasoning to measure alignment with human experts. Key findings show that reference-based evaluation improves reliability, while generalization to emergent content remains challenging for current models. By making datasets and an interactive leaderboard public, the work aims to drive progress in AI-assisted coding and establish reproducible, scalable benchmarks. The study also analyzes self-preference biases in LLM judges, finding that grounding evaluations in high-quality references mitigates such biases. Overall, StackEval/StackUnseen provide practical, dynamic benchmarks with implications for deploying and improving AI copilots in real-world software development.

Abstract

We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance. To ensure reproducibility, we publicly share our datasets and evaluation code at https://github.com/ProsusAI/stack-eval .

Paper Structure

This paper contains 16 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: StackEval & StackUnseen Programming Language Distribution. The questions are subdivided based on the programming languages and type. The distribution of languages is sampled based on popularity of said languages as indicated in the Stack Overflow Developer Survey, 2023 stack2023.
  • Figure 2: Evaluation methodology for assessing LLMs on coding tasks.a) LLM-as-a-Judge benchmark (CoT + Ref. Answer) comparing LLM-$t$ (model under test) against human experts when evaluating answers from LLM-$x$ models . b) Coding assistance evaluation where LLM-$t$ generates StackOverflow answers, scored by an LLM judge.
  • Figure 3: Left: Correlation between model performance across different coding benchmarks shows strong positive correlations (0.72-0.92). Right: Model performance across different question types within the benchmark are very highly correlated (0.91-1.00), suggesting consistent performance across task categories.
  • Figure 4: Model Performance Degradation on Recent Problems. LLMs with higher StackEval scores show smaller acceptance rate drops on StackUnseen, suggesting better generalization to contemporary problems.
  • Figure 5: The performance of various LLMs across different question types on the StackEval benchmark, averaged across programming languages.
  • ...and 8 more figures