Let Guidelines Guide You: A Prescriptive Guideline-Centered Data Annotation Methodology
Federico Ruggeri, Eleonora Misino, Arianna Muti, Katerina Korre, Paolo Torroni, Alberto Barrón-Cedeño
TL;DR
GCAM addresses key flaws of standard annotation by tying data samples to explicit guideline fragments rather than fixed class labels, enabling transparent adherence and reuse across tasks. It formalizes a two-stage process where annotators map to guideline subsets and a separate grounding function translates guidelines to classes, allowing cross-task support and richer model evaluation. Empirical results from human annotation and ML experiments show GCAM achieves comparable annotation quality to SAM while offering deeper insight into guideline adherence and model alignment, with encoder-based models generally performing well and LLMs facing challenges in identifying appropriate guideline-groundings. The approach promises improved data quality and error analysis, with release of data and code to support reproducibility and broader adoption across annotation paradigms.
Abstract
We introduce the Guideline-Centered Annotation Methodology (GCAM), a novel data annotation methodology designed to report the annotation guidelines associated with each data sample. Our approach addresses three key limitations of the standard prescriptive annotation methodology by reducing the information loss during annotation and ensuring adherence to guidelines. Furthermore, GCAM enables the efficient reuse of annotated data across multiple tasks. We evaluate GCAM in two ways: (i) through a human annotation study and (ii) an experimental evaluation with several machine learning models. Our results highlight the advantages of GCAM from multiple perspectives, demonstrating its potential to improve annotation quality and error analysis.
