Table of Contents
Fetching ...

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, Michael Pradel

TL;DR

This work investigates whether large language models (LLMs) can safely replace portions of human annotation in software engineering evaluations, addressing the cost and generalizability challenges of human-subject studies. By applying six state-of-the-art LLMs to ten annotation tasks across five datasets, the study finds that model-model agreement often aligns with human-model agreement and that confidence-based sample selection can reduce human effort without substantially degrading inter-rater reliability. A practical two-step workflow is proposed: first gauge task suitability via model-model agreement, then selectively replace high-confidence samples with LLM outputs. The findings offer a first step toward mixed human–LLM evaluations in SE, with clear guidelines on when and how to delegate annotation tasks to LLMs.

Abstract

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

TL;DR

This work investigates whether large language models (LLMs) can safely replace portions of human annotation in software engineering evaluations, addressing the cost and generalizability challenges of human-subject studies. By applying six state-of-the-art LLMs to ten annotation tasks across five datasets, the study finds that model-model agreement often aligns with human-model agreement and that confidence-based sample selection can reduce human effort without substantially degrading inter-rater reliability. A practical two-step workflow is proposed: first gauge task suitability via model-model agreement, then selectively replace high-confidence samples with LLM outputs. The findings offer a first step toward mixed human–LLM evaluations in SE, with clear guidelines on when and how to delegate annotation tasks to LLMs.

Abstract

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.
Paper Structure (16 sections, 14 figures, 2 tables)

This paper contains 16 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Annotation tasks with different difficulty level.
  • Figure 2: Inter-rater agreement (Krippendorff's $\alpha$) for code summarization accuracy and similarity. Results for adequacy and conciseness are similar (omitted due to space).
  • Figure 3: Inter-rater agreement for name-value inconsistencies.
  • Figure 4: Inter-rater agreement for causality.
  • Figure 5: Inter-rater agreement for semantic similarity with evaluation criteria "Goals". Other criteria ("Operations" and "Effects") give similar heatmaps (omitted due to space).
  • ...and 9 more figures