Table of Contents
Fetching ...

HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding

Hanjun Luo, Chiming Ni, Jiaheng Wen, Zhimu Huang, Yiran Wang, Bingduo Liao, Sylvia Chung, Yingbin Jin, Xinfeng Li, Wenyuan Xu, XiaoFeng Wang, Hanan Salam

TL;DR

HAI-Eval addresses the gap in evaluating human-AI collaboration in coding by introducing a collaboration-necessary benchmark with a problem-template bank, dynamic task instantiation, an ecologically valid cloud IDE, and a reproducible LLM-evaluation toolkit. The framework demonstrates that standalone LLMs and unaided developers underperform on complex, context-rich tasks, while human-AI teams achieve substantial gains through co-reasoning, challenging the traditional tool-centric view of AI in software engineering. Key contributions include a rigorous design grounded in ecological validity and necessary collaboration, extensive empirical validation with a within-subject study, and open-source infrastructure to benchmark future models and assess core developer competencies in the AI era. The work highlights a paradigm shift toward collaborative problem solving where strategic breakthroughs can originate from either humans or AI, and sets the stage for more nuanced evaluation of autonomous coding agents and developer skills. The practical impact lies in providing a scalable, realistic framework to quantify human value and guide the development of next-generation coding assistants.

Abstract

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.

HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding

TL;DR

HAI-Eval addresses the gap in evaluating human-AI collaboration in coding by introducing a collaboration-necessary benchmark with a problem-template bank, dynamic task instantiation, an ecologically valid cloud IDE, and a reproducible LLM-evaluation toolkit. The framework demonstrates that standalone LLMs and unaided developers underperform on complex, context-rich tasks, while human-AI teams achieve substantial gains through co-reasoning, challenging the traditional tool-centric view of AI in software engineering. Key contributions include a rigorous design grounded in ecological validity and necessary collaboration, extensive empirical validation with a within-subject study, and open-source infrastructure to benchmark future models and assess core developer competencies in the AI era. The work highlights a paradigm shift toward collaborative problem solving where strategic breakthroughs can originate from either humans or AI, and sets the stage for more nuanced evaluation of autonomous coding agents and developer skills. The practical impact lies in providing a scalable, realistic framework to quantify human value and guide the development of next-generation coding assistants.

Abstract

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.

Paper Structure

This paper contains 101 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: HAI-Eval provides two interfaces for evaluation, underscoring its two contributions to the community. The chart displays the performance improvement by human-AI collaboration.
  • Figure 2: The overall architecture of HAI-Eval.
  • Figure 2: Performance comparison of 4 conditions across difficulties. The final row shows the Averaged Overall Pass@1 across 3 difficulties.
  • Figure 3: The design-validation pipeline for transforming algorithmic cores into templates.
  • Figure 4: Visualization of key participant feedback. Details of feedback statistics are provided in Appendix \ref{['apx:detailed_human_data']}.
  • ...and 7 more figures