To Aggregate or Not to Aggregate. That is the Question: A Case Study on Annotation Subjectivity in Span Prediction
Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau
TL;DR
This work tackles automatic span prediction for legal problem descriptions annotated by multiple lawyers, addressing intrinsic subjectivity in legal area labeling. It introduces a subjectivity-aware evaluation framework with span- and word-level metrics and two gold-span strategies (majority-voted and best-matched), and compares two training regimes: aggregating annotations (MV) versus repeating labels (ReL). Using a neural sequence tagger with pretrained language models (BERT base uncased and DeBERTaV3) and a BiLSTM-CRF, the study finds that training on majority-voted spans generally yields higher span-level accuracy, while stronger models can modulate precision-recall trade-offs. The work highlights practical considerations for handling subjective annotations in span-based NLP tasks and discusses limitations, including data privacy and reproducibility, while offering directions for future evaluation and methodology refinement.
Abstract
This paper explores the task of automatic prediction of text spans in a legal problem description that support a legal area label. We use a corpus of problem descriptions written by laypeople in English that is annotated by practising lawyers. Inherent subjectivity exists in our task because legal area categorisation is a complex task, and lawyers often have different views on a problem, especially in the face of legally-imprecise descriptions of issues. Experiments show that training on majority-voted spans outperforms training on disaggregated ones.
