HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems
Jun Xing, Mayur Bhatia, Sahil Phulwani, Darshan Suresh, Rafik Matta
TL;DR
The paper presents HackerRank-ASTRA, a project-based, multi-file benchmark for evaluating large language models in real-world software tasks, with a focus on frontend technologies. It introduces metrics for correctness and consistency, including Mean Score, Mean Pass@1, and median SD across $k=32$ runs, applied to 65 frontend problems spanning 10 skill domains. Results show mean performance around 70% with similar averages across top models, while Claude-3.5-Sonnet-1022 demonstrates the highest consistency ($\mathrm{SD}=0.0497$); formatting and guardrail behavior significantly affect outcomes, indicating practical considerations for deployment. Overall, ASTRA provides a realistic, repeatable framework to assess LLM reliability in software development and highlights areas for improvement in formatting, end-to-end integration, and skill-specific capabilities.
Abstract
Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.
