Table of Contents
Fetching ...

OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering

Mia Mohammad Imran, Tarannum Shaila Zaman

TL;DR

The paper addresses methodological gaps in LLM-based annotation for empirical software engineering by reframing annotation as a measurement process. It introduces OLAF, a framework that defines six measurement dimensions (reliability, calibration, drift, consensus, aggregation, transparency) and maps them to multiple annotation configurations to enhance reproducibility and transparency. It provides concrete metrics, guidelines, and discussion of limitations to support auditable LLM-driven labeling in SE research. The work lays a foundation for more reliable, comparable, and open LSE studies and outlines directions for empirical validation and tool support.

Abstract

Large Language Models (LLMs) are increasingly used in empirical software engineering (ESE) to automate or assist annotation tasks such as labeling commits, issues, and qualitative artifacts. Yet the reliability and reproducibility of such annotations remain underexplored. Existing studies often lack standardized measures for reliability, calibration, and drift, and frequently omit essential configuration details. We argue that LLM-based annotation should be treated as a measurement process rather than a purely automated activity. In this position paper, we outline the \textbf{Operationalization for LLM-based Annotation Framework (OLAF)}, a conceptual framework that organizes key constructs: \textit{reliability, calibration, drift, consensus, aggregation}, and \textit{transparency}. The paper aims to motivate methodological discussion and future empirical work toward more transparent and reproducible LLM-based annotation in software engineering research.

OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering

TL;DR

The paper addresses methodological gaps in LLM-based annotation for empirical software engineering by reframing annotation as a measurement process. It introduces OLAF, a framework that defines six measurement dimensions (reliability, calibration, drift, consensus, aggregation, transparency) and maps them to multiple annotation configurations to enhance reproducibility and transparency. It provides concrete metrics, guidelines, and discussion of limitations to support auditable LLM-driven labeling in SE research. The work lays a foundation for more reliable, comparable, and open LSE studies and outlines directions for empirical validation and tool support.

Abstract

Large Language Models (LLMs) are increasingly used in empirical software engineering (ESE) to automate or assist annotation tasks such as labeling commits, issues, and qualitative artifacts. Yet the reliability and reproducibility of such annotations remain underexplored. Existing studies often lack standardized measures for reliability, calibration, and drift, and frequently omit essential configuration details. We argue that LLM-based annotation should be treated as a measurement process rather than a purely automated activity. In this position paper, we outline the \textbf{Operationalization for LLM-based Annotation Framework (OLAF)}, a conceptual framework that organizes key constructs: \textit{reliability, calibration, drift, consensus, aggregation}, and \textit{transparency}. The paper aims to motivate methodological discussion and future empirical work toward more transparent and reproducible LLM-based annotation in software engineering research.

Paper Structure

This paper contains 14 sections, 2 tables.