Table of Contents
Fetching ...

HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems

Jun Xing, Mayur Bhatia, Sahil Phulwani, Darshan Suresh, Rafik Matta

TL;DR

The paper presents HackerRank-ASTRA, a project-based, multi-file benchmark for evaluating large language models in real-world software tasks, with a focus on frontend technologies. It introduces metrics for correctness and consistency, including Mean Score, Mean Pass@1, and median SD across $k=32$ runs, applied to 65 frontend problems spanning 10 skill domains. Results show mean performance around 70% with similar averages across top models, while Claude-3.5-Sonnet-1022 demonstrates the highest consistency ($\mathrm{SD}=0.0497$); formatting and guardrail behavior significantly affect outcomes, indicating practical considerations for deployment. Overall, ASTRA provides a realistic, repeatable framework to assess LLM reliability in software development and highlights areas for improvement in formatting, end-to-end integration, and skill-specific capabilities.

Abstract

Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.

HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems

TL;DR

The paper presents HackerRank-ASTRA, a project-based, multi-file benchmark for evaluating large language models in real-world software tasks, with a focus on frontend technologies. It introduces metrics for correctness and consistency, including Mean Score, Mean Pass@1, and median SD across runs, applied to 65 frontend problems spanning 10 skill domains. Results show mean performance around 70% with similar averages across top models, while Claude-3.5-Sonnet-1022 demonstrates the highest consistency (); formatting and guardrail behavior significantly affect outcomes, indicating practical considerations for deployment. Overall, ASTRA provides a realistic, repeatable framework to assess LLM reliability in software development and highlights areas for improvement in formatting, end-to-end integration, and skill-specific capabilities.

Abstract

Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.

Paper Structure

This paper contains 29 sections, 2 equations, 25 figures, 6 tables.

Figures (25)

  • Figure 1: Distribution of v1 HackerRank-ASTRA benchmark main skill frequency.
  • Figure 2: Distribution of v1 HackerRank-ASTRA benchmark sub-skill frequency.
  • Figure 3: Project structure of a sample RESTful API problem.
  • Figure 4: Diagram of v1 HackerRank-ASTRA benchmark evaluation pipeline.
  • Figure 5: XML prompt of v1 HackerRank-ASTRA benchmark.
  • ...and 20 more figures