Table of Contents
Fetching ...

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Feng Gu, Zongxia Li, Carlos Rafael Colon, Benjamin Evans, Ishani Mondal, Jordan Lee Boyd-Graber

TL;DR

This paper investigates a holistic, cross-document event annotation workflow and evaluates large language models as assistants rather than independent annotators. By pairing embedding-based similarity for event-set clustering with prompt-driven llm classification and segmentation, the study shows that LLMs outperform tf-idf baselines in event-set curation and can meaningfully reduce downstream variable coding time when used in hybrid human–machine setups. Key findings include embedding achieving high precision ($${0.89}$$) and llm+seg improving recall and overall $F_1$ compared to alternatives, with near-human agreement observed for LM-extracted variables in hybrid conditions. Together, these results offer a practical path toward scalable, high-quality event annotation while highlighting current limitations and the need for careful integration strategies in real-world pipelines.

Abstract

Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

TL;DR

This paper investigates a holistic, cross-document event annotation workflow and evaluates large language models as assistants rather than independent annotators. By pairing embedding-based similarity for event-set clustering with prompt-driven llm classification and segmentation, the study shows that LLMs outperform tf-idf baselines in event-set curation and can meaningfully reduce downstream variable coding time when used in hybrid human–machine setups. Key findings include embedding achieving high precision () and llm+seg improving recall and overall compared to alternatives, with near-human agreement observed for LM-extracted variables in hybrid conditions. Together, these results offer a practical path toward scalable, high-quality event annotation while highlighting current limitations and the need for careful integration strategies in real-world pipelines.

Abstract

Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our workflow for annotating events data begins with preprocessing incoming media news. A Support Vector Machine identifies highly relevant documents for manual review. During Event Set Curation, human annotators create unique event sets. Finally, annotators code the domain-specific variables. We apply lm-based similarity indices and use lm-extracted variables to aid manual processing.
  • Figure 2: The agreement between the manual and the automated settings is comparable. On average, annotators and lm agree 50% using nm. pedants and bert show higher agreements. The difference between human-human and human-lm agreement is not statistically significant, suggesting that the lm-extracted variables provide approximately human-level utility.
  • Figure 3: Agreement by event set type and setting. Annotators show higher agreement in the hybrid setting, where extracted variables are available. This indicates that these variables help code the events. Furthermore, the extracted variables prove particularly beneficial in lm-generated incident sets, which often contain misinformation.
  • Figure 4: Agreement grouped by variable type. Human annotators agree more with extracted variables with higher degree of specificity. Country has over 90% agreement. Generic attack type and weapon type also high agreement. In comparison, low specificity variables like location demonstrate low agreement with human judgment.